--- id: wiki-2026-0508-command-center title: Command Center category: 10_Wiki/Topics status: verified canonical_id: self aliases: [NOC, Operations Center, War Room] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [operations, incident-response, observability, sre] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: any framework: Grafana, Prometheus, PagerDuty --- # Command Center ## 매 한 줄 > **"매 Command Center 매 cross-system situational awareness 의 single pane"**. 매 NASA Mission Control 의 origin 매 modern SRE NOC, AWS-style war-room 의 조상. 2026 매 LLM-assisted incident commander (Claude Opus 4.7) 의 augment 의 standard. ## 매 핵심 ### 매 구성 - **Big-board screens**: 매 service health, traffic, error budget, deploy state. - **Roles**: Incident Commander (IC), Comms lead, Scribe, SMEs. - **Comms channels**: 매 dedicated Slack/Teams + voice bridge. - **Runbooks**: 매 indexed, searchable, version-controlled. - **Decision log**: 매 incident timestamp + decision + reasoning. ### 매 incident phases 1. **Detect** — 매 alert / customer report. 2. **Triage** — 매 severity classification (sev1 ~ sev5). 3. **Mitigate** — 매 immediate impact reduction. 4. **Resolve** — 매 root-cause fix. 5. **Postmortem** — 매 blameless review + action items. ### 매 응용 1. Sev1 게임 day quarterly 매 muscle-memory 유지. 2. **Single-pane dashboard** 의 SLO + error budget + on-call status. 3. **Incident bot** 의 channel-create + role-assign + scribe-prompt 자동화. ## 💻 패턴 ### Incident channel bot ```typescript // incident-bot.ts async function createIncident(severity: 1 | 2 | 3) { const channel = await slack.conversations.create({ name: `inc-${dayjs().format('YYYYMMDD-HHmm')}-sev${severity}`, }); await pagerduty.createIncident({ severity }); await postRunbookLink(channel.id); await assignRoles(channel.id, { ic: oncall(), scribe: backup() }); return channel; } ``` ### Big-board layout (Grafana) ```yaml # dashboard.yml panels: - row: top items: [global_qps, global_error_rate, p99_latency, saturation] - row: middle items: [api_health, db_health, cache_health, queue_depth] - row: bottom items: [deploy_state, on_call_roster, error_budget_burn] ``` ### Severity matrix ```python # severity.py def classify(impact_users: int, impact_revenue_per_hr: float, data_loss: bool) -> int: if data_loss or impact_revenue_per_hr > 100_000: return 1 if impact_users > 10_000: return 2 if impact_users > 100: return 3 return 4 ``` ### Incident decision log (markdown) ```markdown # inc-2026-0510-1432-sev1 | Time | Actor | Decision | Reasoning | |---|---|---|---| | 14:32 | IC | Page DB on-call | DB cpu 100% 5m | | 14:38 | DB-SME | Failover replica | Primary unresponsive | | 14:41 | IC | Status page yellow | Degraded checkout | ``` ### LLM-assisted incident summary ```python # llm_summary.py prompt = f""" You are an SRE assistant. Summarize this incident channel transcript: - Timeline (5 bullets max) - Root cause hypothesis - Customer impact - Followups Transcript: {channel_transcript} """ summary = anthropic.messages.create( model="claude-opus-4-7", max_tokens=2000, messages=[{"role": "user", "content": prompt}], ) ``` ### Runbook structure ```markdown # runbook: api-5xx-spike ## Detect: alert "api-5xx > 1% 5m" ## Mitigate 1. Check deploy in last 30m → rollback: `kubectl rollout undo deploy/api` 2. Check DB connections → scale pool: `kubectl scale ...` ## Verify - Error rate <0.1% for 10m ## Escalate to - @api-team if 30m without recovery ``` ### Postmortem template ```markdown ## Summary ## Timeline (UTC) ## Root cause ## Impact (users, $, duration) ## What went well ## What didn't ## Action items (DRI, due date) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Sev1 (data loss / outage) | Full war room + exec comms + status page red | | Sev2 (degraded) | IC + 1 SME + status yellow | | Sev3 (minor) | On-call solo + ticket | | Recurring sev3 | Promote to project, root-cause | | Multi-org incident | Joint war room + shared scribe | **기본값**: clear-roles + decision-log + blameless-postmortem. ## 🔗 Graph - 부모: [[Site Reliability Engineering]] - 변형: [[War Room]] - 응용: [[Observability]] · [[On-Call]] - Adjacent: [[Postmortem]] · [[Runbook]] ## 🤖 LLM 활용 **언제**: 매 incident channel transcript 의 summarization, 매 timeline reconstruction, 매 postmortem 의 first-draft. **언제 X**: 매 critical mitigation decision — 매 human IC 의 final call. LLM 매 advisor only. ## ❌ 안티패턴 - **Hero culture**: 매 single SRE 의 24/7 매 burnout + bus-factor 1. - **Blame-game postmortem**: 매 culture 의 silence 야기. - **Runbook rot**: 매 6-month-old runbook 매 broken commands. - **Dashboard bloat**: 매 100+ panel 매 signal/noise 1:50. - **Status page lag**: 매 customer 가 first 알림 — 매 trust 손실. ## 🧪 검증 / 중복 - Verified (Google _SRE Workbook_ Ch.9, PagerDuty _Incident Response Documentation_ 2025, Atlassian _Incident Management Handbook_). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — incident phases, severity matrix, runbook, anti-patterns |