Files
2nd/10_Wiki/Topics/Other/Command Center.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.2 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-command-center Command Center 10_Wiki/Topics verified self
NOC
Operations Center
War Room
none A 0.9 applied
operations
incident-response
observability
sre
2026-05-10 pending
language framework
any Grafana, Prometheus, PagerDuty

Command Center

매 한 줄

"매 Command Center 매 cross-system situational awareness 의 single pane". 매 NASA Mission Control 의 origin 매 modern SRE NOC, AWS-style war-room 의 조상. 2026 매 LLM-assisted incident commander (Claude Opus 4.7) 의 augment 의 standard.

매 핵심

매 구성

  • Big-board screens: 매 service health, traffic, error budget, deploy state.
  • Roles: Incident Commander (IC), Comms lead, Scribe, SMEs.
  • Comms channels: 매 dedicated Slack/Teams + voice bridge.
  • Runbooks: 매 indexed, searchable, version-controlled.
  • Decision log: 매 incident timestamp + decision + reasoning.

매 incident phases

  1. Detect — 매 alert / customer report.
  2. Triage — 매 severity classification (sev1 ~ sev5).
  3. Mitigate — 매 immediate impact reduction.
  4. Resolve — 매 root-cause fix.
  5. Postmortem — 매 blameless review + action items.

매 응용

  1. Sev1 게임 day quarterly 매 muscle-memory 유지.
  2. Single-pane dashboard 의 SLO + error budget + on-call status.
  3. Incident bot 의 channel-create + role-assign + scribe-prompt 자동화.

💻 패턴

Incident channel bot

// incident-bot.ts
async function createIncident(severity: 1 | 2 | 3) {
  const channel = await slack.conversations.create({
    name: `inc-${dayjs().format('YYYYMMDD-HHmm')}-sev${severity}`,
  });
  await pagerduty.createIncident({ severity });
  await postRunbookLink(channel.id);
  await assignRoles(channel.id, { ic: oncall(), scribe: backup() });
  return channel;
}

Big-board layout (Grafana)

# dashboard.yml
panels:
  - row: top
    items: [global_qps, global_error_rate, p99_latency, saturation]
  - row: middle
    items: [api_health, db_health, cache_health, queue_depth]
  - row: bottom
    items: [deploy_state, on_call_roster, error_budget_burn]

Severity matrix

# severity.py
def classify(impact_users: int, impact_revenue_per_hr: float, data_loss: bool) -> int:
    if data_loss or impact_revenue_per_hr > 100_000: return 1
    if impact_users > 10_000: return 2
    if impact_users > 100: return 3
    return 4

Incident decision log (markdown)

# inc-2026-0510-1432-sev1
| Time | Actor | Decision | Reasoning |
|---|---|---|---|
| 14:32 | IC | Page DB on-call | DB cpu 100% 5m |
| 14:38 | DB-SME | Failover replica | Primary unresponsive |
| 14:41 | IC | Status page yellow | Degraded checkout |

LLM-assisted incident summary

# llm_summary.py
prompt = f"""
You are an SRE assistant. Summarize this incident channel transcript:
- Timeline (5 bullets max)
- Root cause hypothesis
- Customer impact
- Followups

Transcript: {channel_transcript}
"""
summary = anthropic.messages.create(
    model="claude-opus-4-7",
    max_tokens=2000,
    messages=[{"role": "user", "content": prompt}],
)

Runbook structure

# runbook: api-5xx-spike
## Detect: alert "api-5xx > 1% 5m"
## Mitigate
1. Check deploy in last 30m → rollback: `kubectl rollout undo deploy/api`
2. Check DB connections → scale pool: `kubectl scale ...`
## Verify
- Error rate <0.1% for 10m
## Escalate to
- @api-team if 30m without recovery

Postmortem template

## Summary
## Timeline (UTC)
## Root cause
## Impact (users, $, duration)
## What went well
## What didn't
## Action items (DRI, due date)

매 결정 기준

상황 Approach
Sev1 (data loss / outage) Full war room + exec comms + status page red
Sev2 (degraded) IC + 1 SME + status yellow
Sev3 (minor) On-call solo + ticket
Recurring sev3 Promote to project, root-cause
Multi-org incident Joint war room + shared scribe

기본값: clear-roles + decision-log + blameless-postmortem.

🔗 Graph

🤖 LLM 활용

언제: 매 incident channel transcript 의 summarization, 매 timeline reconstruction, 매 postmortem 의 first-draft. 언제 X: 매 critical mitigation decision — 매 human IC 의 final call. LLM 매 advisor only.

안티패턴

  • Hero culture: 매 single SRE 의 24/7 매 burnout + bus-factor 1.
  • Blame-game postmortem: 매 culture 의 silence 야기.
  • Runbook rot: 매 6-month-old runbook 매 broken commands.
  • Dashboard bloat: 매 100+ panel 매 signal/noise 1:50.
  • Status page lag: 매 customer 가 first 알림 — 매 trust 손실.

🧪 검증 / 중복

  • Verified (Google SRE Workbook Ch.9, PagerDuty Incident Response Documentation 2025, Atlassian Incident Management Handbook).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — incident phases, severity matrix, runbook, anti-patterns