2nd/10_Wiki/Topics/Other/Command Center.md

---
id: wiki-2026-0508-command-center
title: Command Center
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [NOC, Operations Center, War Room]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [operations, incident-response, observability, sre]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: any
  framework: Grafana, Prometheus, PagerDuty
---

# Command Center

## 매 한 줄
> **"매 Command Center 매 cross-system situational awareness 의 single pane"**. 매 NASA Mission Control 의 origin 매 modern SRE NOC, AWS-style war-room 의 조상. 2026 매 LLM-assisted incident commander (Claude Opus 4.7) 의 augment 의 standard.

## 매 핵심

### 매 구성
- **Big-board screens**: 매 service health, traffic, error budget, deploy state.
- **Roles**: Incident Commander (IC), Comms lead, Scribe, SMEs.
- **Comms channels**: 매 dedicated Slack/Teams + voice bridge.
- **Runbooks**: 매 indexed, searchable, version-controlled.
- **Decision log**: 매 incident timestamp + decision + reasoning.

### 매 incident phases
1. **Detect** — 매 alert / customer report.
2. **Triage** — 매 severity classification (sev1 ~ sev5).
3. **Mitigate** — 매 immediate impact reduction.
4. **Resolve** — 매 root-cause fix.
5. **Postmortem** — 매 blameless review + action items.

### 매 응용
1. Sev1 게임 day quarterly 매 muscle-memory 유지.
2. **Single-pane dashboard** 의 SLO + error budget + on-call status.
3. **Incident bot** 의 channel-create + role-assign + scribe-prompt 자동화.

## 💻 패턴

### Incident channel bot
```typescript
// incident-bot.ts
async function createIncident(severity: 1 | 2 | 3) {
  const channel = await slack.conversations.create({
    name: `inc-${dayjs().format('YYYYMMDD-HHmm')}-sev${severity}`,
  });
  await pagerduty.createIncident({ severity });
  await postRunbookLink(channel.id);
  await assignRoles(channel.id, { ic: oncall(), scribe: backup() });
  return channel;
}
```

### Big-board layout (Grafana)
```yaml
# dashboard.yml
panels:
  - row: top
    items: [global_qps, global_error_rate, p99_latency, saturation]
  - row: middle
    items: [api_health, db_health, cache_health, queue_depth]
  - row: bottom
    items: [deploy_state, on_call_roster, error_budget_burn]
```

### Severity matrix
```python
# severity.py
def classify(impact_users: int, impact_revenue_per_hr: float, data_loss: bool) -> int:
    if data_loss or impact_revenue_per_hr > 100_000: return 1
    if impact_users > 10_000: return 2
    if impact_users > 100: return 3
    return 4
```

### Incident decision log (markdown)
```markdown
# inc-2026-0510-1432-sev1
| Time | Actor | Decision | Reasoning |
|---|---|---|---|
| 14:32 | IC | Page DB on-call | DB cpu 100% 5m |
| 14:38 | DB-SME | Failover replica | Primary unresponsive |
| 14:41 | IC | Status page yellow | Degraded checkout |
```

### LLM-assisted incident summary
```python
# llm_summary.py
prompt = f"""
You are an SRE assistant. Summarize this incident channel transcript:
- Timeline (5 bullets max)
- Root cause hypothesis
- Customer impact
- Followups

Transcript: {channel_transcript}
"""
summary = anthropic.messages.create(
    model="claude-opus-4-7",
    max_tokens=2000,
    messages=[{"role": "user", "content": prompt}],
)
```

### Runbook structure
```markdown
# runbook: api-5xx-spike
## Detect: alert "api-5xx > 1% 5m"
## Mitigate
1. Check deploy in last 30m → rollback: `kubectl rollout undo deploy/api`
2. Check DB connections → scale pool: `kubectl scale ...`
## Verify
- Error rate <0.1% for 10m
## Escalate to
- @api-team if 30m without recovery
```

### Postmortem template
```markdown
## Summary
## Timeline (UTC)
## Root cause
## Impact (users, $, duration)
## What went well
## What didn't
## Action items (DRI, due date)
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Sev1 (data loss / outage) | Full war room + exec comms + status page red |
| Sev2 (degraded) | IC + 1 SME + status yellow |
| Sev3 (minor) | On-call solo + ticket |
| Recurring sev3 | Promote to project, root-cause |
| Multi-org incident | Joint war room + shared scribe |

**기본값**: clear-roles + decision-log + blameless-postmortem.

## 🔗 Graph
- 부모: [[Site Reliability Engineering]]
- 변형: [[War Room]]
- 응용: [[Observability]] · [[On-Call]]
- Adjacent: [[Postmortem]] · [[Runbook]]

## 🤖 LLM 활용
**언제**: 매 incident channel transcript 의 summarization, 매 timeline reconstruction, 매 postmortem 의 first-draft.
**언제 X**: 매 critical mitigation decision — 매 human IC 의 final call. LLM 매 advisor only.

## ❌ 안티패턴
- **Hero culture**: 매 single SRE 의 24/7 매 burnout + bus-factor 1.
- **Blame-game postmortem**: 매 culture 의 silence 야기.
- **Runbook rot**: 매 6-month-old runbook 매 broken commands.
- **Dashboard bloat**: 매 100+ panel 매 signal/noise 1:50.
- **Status page lag**: 매 customer 가 first 알림 — 매 trust 손실.

## 🧪 검증 / 중복
- Verified (Google _SRE Workbook_ Ch.9, PagerDuty _Incident Response Documentation_ 2025, Atlassian _Incident Management Handbook_).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — incident phases, severity matrix, runbook, anti-patterns |