Files
2nd/10_Wiki/Topics/Other/Command Center.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

175 lines
5.2 KiB
Markdown

---
id: wiki-2026-0508-command-center
title: Command Center
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [NOC, Operations Center, War Room]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [operations, incident-response, observability, sre]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: any
framework: Grafana, Prometheus, PagerDuty
---
# Command Center
## 매 한 줄
> **"매 Command Center 매 cross-system situational awareness 의 single pane"**. 매 NASA Mission Control 의 origin 매 modern SRE NOC, AWS-style war-room 의 조상. 2026 매 LLM-assisted incident commander (Claude Opus 4.7) 의 augment 의 standard.
## 매 핵심
### 매 구성
- **Big-board screens**: 매 service health, traffic, error budget, deploy state.
- **Roles**: Incident Commander (IC), Comms lead, Scribe, SMEs.
- **Comms channels**: 매 dedicated Slack/Teams + voice bridge.
- **Runbooks**: 매 indexed, searchable, version-controlled.
- **Decision log**: 매 incident timestamp + decision + reasoning.
### 매 incident phases
1. **Detect** — 매 alert / customer report.
2. **Triage** — 매 severity classification (sev1 ~ sev5).
3. **Mitigate** — 매 immediate impact reduction.
4. **Resolve** — 매 root-cause fix.
5. **Postmortem** — 매 blameless review + action items.
### 매 응용
1. Sev1 게임 day quarterly 매 muscle-memory 유지.
2. **Single-pane dashboard** 의 SLO + error budget + on-call status.
3. **Incident bot** 의 channel-create + role-assign + scribe-prompt 자동화.
## 💻 패턴
### Incident channel bot
```typescript
// incident-bot.ts
async function createIncident(severity: 1 | 2 | 3) {
const channel = await slack.conversations.create({
name: `inc-${dayjs().format('YYYYMMDD-HHmm')}-sev${severity}`,
});
await pagerduty.createIncident({ severity });
await postRunbookLink(channel.id);
await assignRoles(channel.id, { ic: oncall(), scribe: backup() });
return channel;
}
```
### Big-board layout (Grafana)
```yaml
# dashboard.yml
panels:
- row: top
items: [global_qps, global_error_rate, p99_latency, saturation]
- row: middle
items: [api_health, db_health, cache_health, queue_depth]
- row: bottom
items: [deploy_state, on_call_roster, error_budget_burn]
```
### Severity matrix
```python
# severity.py
def classify(impact_users: int, impact_revenue_per_hr: float, data_loss: bool) -> int:
if data_loss or impact_revenue_per_hr > 100_000: return 1
if impact_users > 10_000: return 2
if impact_users > 100: return 3
return 4
```
### Incident decision log (markdown)
```markdown
# inc-2026-0510-1432-sev1
| Time | Actor | Decision | Reasoning |
|---|---|---|---|
| 14:32 | IC | Page DB on-call | DB cpu 100% 5m |
| 14:38 | DB-SME | Failover replica | Primary unresponsive |
| 14:41 | IC | Status page yellow | Degraded checkout |
```
### LLM-assisted incident summary
```python
# llm_summary.py
prompt = f"""
You are an SRE assistant. Summarize this incident channel transcript:
- Timeline (5 bullets max)
- Root cause hypothesis
- Customer impact
- Followups
Transcript: {channel_transcript}
"""
summary = anthropic.messages.create(
model="claude-opus-4-7",
max_tokens=2000,
messages=[{"role": "user", "content": prompt}],
)
```
### Runbook structure
```markdown
# runbook: api-5xx-spike
## Detect: alert "api-5xx > 1% 5m"
## Mitigate
1. Check deploy in last 30m → rollback: `kubectl rollout undo deploy/api`
2. Check DB connections → scale pool: `kubectl scale ...`
## Verify
- Error rate <0.1% for 10m
## Escalate to
- @api-team if 30m without recovery
```
### Postmortem template
```markdown
## Summary
## Timeline (UTC)
## Root cause
## Impact (users, $, duration)
## What went well
## What didn't
## Action items (DRI, due date)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Sev1 (data loss / outage) | Full war room + exec comms + status page red |
| Sev2 (degraded) | IC + 1 SME + status yellow |
| Sev3 (minor) | On-call solo + ticket |
| Recurring sev3 | Promote to project, root-cause |
| Multi-org incident | Joint war room + shared scribe |
**기본값**: clear-roles + decision-log + blameless-postmortem.
## 🔗 Graph
- 부모: [[Site Reliability Engineering]]
- 변형: [[War Room]]
- 응용: [[Observability]] · [[On-Call]]
- Adjacent: [[Postmortem]] · [[Runbook]]
## 🤖 LLM 활용
**언제**: 매 incident channel transcript 의 summarization, 매 timeline reconstruction, 매 postmortem 의 first-draft.
**언제 X**: 매 critical mitigation decision — 매 human IC 의 final call. LLM 매 advisor only.
## ❌ 안티패턴
- **Hero culture**: 매 single SRE 의 24/7 매 burnout + bus-factor 1.
- **Blame-game postmortem**: 매 culture 의 silence 야기.
- **Runbook rot**: 매 6-month-old runbook 매 broken commands.
- **Dashboard bloat**: 매 100+ panel 매 signal/noise 1:50.
- **Status page lag**: 매 customer 가 first 알림 — 매 trust 손실.
## 🧪 검증 / 중복
- Verified (Google _SRE Workbook_ Ch.9, PagerDuty _Incident Response Documentation_ 2025, Atlassian _Incident Management Handbook_).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — incident phases, severity matrix, runbook, anti-patterns |