f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
175 lines
5.2 KiB
Markdown
175 lines
5.2 KiB
Markdown
---
|
|
id: wiki-2026-0508-command-center
|
|
title: Command Center
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [NOC, Operations Center, War Room]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [operations, incident-response, observability, sre]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: any
|
|
framework: Grafana, Prometheus, PagerDuty
|
|
---
|
|
|
|
# Command Center
|
|
|
|
## 매 한 줄
|
|
> **"매 Command Center 매 cross-system situational awareness 의 single pane"**. 매 NASA Mission Control 의 origin 매 modern SRE NOC, AWS-style war-room 의 조상. 2026 매 LLM-assisted incident commander (Claude Opus 4.7) 의 augment 의 standard.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 구성
|
|
- **Big-board screens**: 매 service health, traffic, error budget, deploy state.
|
|
- **Roles**: Incident Commander (IC), Comms lead, Scribe, SMEs.
|
|
- **Comms channels**: 매 dedicated Slack/Teams + voice bridge.
|
|
- **Runbooks**: 매 indexed, searchable, version-controlled.
|
|
- **Decision log**: 매 incident timestamp + decision + reasoning.
|
|
|
|
### 매 incident phases
|
|
1. **Detect** — 매 alert / customer report.
|
|
2. **Triage** — 매 severity classification (sev1 ~ sev5).
|
|
3. **Mitigate** — 매 immediate impact reduction.
|
|
4. **Resolve** — 매 root-cause fix.
|
|
5. **Postmortem** — 매 blameless review + action items.
|
|
|
|
### 매 응용
|
|
1. Sev1 게임 day quarterly 매 muscle-memory 유지.
|
|
2. **Single-pane dashboard** 의 SLO + error budget + on-call status.
|
|
3. **Incident bot** 의 channel-create + role-assign + scribe-prompt 자동화.
|
|
|
|
## 💻 패턴
|
|
|
|
### Incident channel bot
|
|
```typescript
|
|
// incident-bot.ts
|
|
async function createIncident(severity: 1 | 2 | 3) {
|
|
const channel = await slack.conversations.create({
|
|
name: `inc-${dayjs().format('YYYYMMDD-HHmm')}-sev${severity}`,
|
|
});
|
|
await pagerduty.createIncident({ severity });
|
|
await postRunbookLink(channel.id);
|
|
await assignRoles(channel.id, { ic: oncall(), scribe: backup() });
|
|
return channel;
|
|
}
|
|
```
|
|
|
|
### Big-board layout (Grafana)
|
|
```yaml
|
|
# dashboard.yml
|
|
panels:
|
|
- row: top
|
|
items: [global_qps, global_error_rate, p99_latency, saturation]
|
|
- row: middle
|
|
items: [api_health, db_health, cache_health, queue_depth]
|
|
- row: bottom
|
|
items: [deploy_state, on_call_roster, error_budget_burn]
|
|
```
|
|
|
|
### Severity matrix
|
|
```python
|
|
# severity.py
|
|
def classify(impact_users: int, impact_revenue_per_hr: float, data_loss: bool) -> int:
|
|
if data_loss or impact_revenue_per_hr > 100_000: return 1
|
|
if impact_users > 10_000: return 2
|
|
if impact_users > 100: return 3
|
|
return 4
|
|
```
|
|
|
|
### Incident decision log (markdown)
|
|
```markdown
|
|
# inc-2026-0510-1432-sev1
|
|
| Time | Actor | Decision | Reasoning |
|
|
|---|---|---|---|
|
|
| 14:32 | IC | Page DB on-call | DB cpu 100% 5m |
|
|
| 14:38 | DB-SME | Failover replica | Primary unresponsive |
|
|
| 14:41 | IC | Status page yellow | Degraded checkout |
|
|
```
|
|
|
|
### LLM-assisted incident summary
|
|
```python
|
|
# llm_summary.py
|
|
prompt = f"""
|
|
You are an SRE assistant. Summarize this incident channel transcript:
|
|
- Timeline (5 bullets max)
|
|
- Root cause hypothesis
|
|
- Customer impact
|
|
- Followups
|
|
|
|
Transcript: {channel_transcript}
|
|
"""
|
|
summary = anthropic.messages.create(
|
|
model="claude-opus-4-7",
|
|
max_tokens=2000,
|
|
messages=[{"role": "user", "content": prompt}],
|
|
)
|
|
```
|
|
|
|
### Runbook structure
|
|
```markdown
|
|
# runbook: api-5xx-spike
|
|
## Detect: alert "api-5xx > 1% 5m"
|
|
## Mitigate
|
|
1. Check deploy in last 30m → rollback: `kubectl rollout undo deploy/api`
|
|
2. Check DB connections → scale pool: `kubectl scale ...`
|
|
## Verify
|
|
- Error rate <0.1% for 10m
|
|
## Escalate to
|
|
- @api-team if 30m without recovery
|
|
```
|
|
|
|
### Postmortem template
|
|
```markdown
|
|
## Summary
|
|
## Timeline (UTC)
|
|
## Root cause
|
|
## Impact (users, $, duration)
|
|
## What went well
|
|
## What didn't
|
|
## Action items (DRI, due date)
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Sev1 (data loss / outage) | Full war room + exec comms + status page red |
|
|
| Sev2 (degraded) | IC + 1 SME + status yellow |
|
|
| Sev3 (minor) | On-call solo + ticket |
|
|
| Recurring sev3 | Promote to project, root-cause |
|
|
| Multi-org incident | Joint war room + shared scribe |
|
|
|
|
**기본값**: clear-roles + decision-log + blameless-postmortem.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Site Reliability Engineering]]
|
|
- 변형: [[War Room]]
|
|
- 응용: [[Observability]] · [[On-Call]]
|
|
- Adjacent: [[Postmortem]] · [[Runbook]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 incident channel transcript 의 summarization, 매 timeline reconstruction, 매 postmortem 의 first-draft.
|
|
**언제 X**: 매 critical mitigation decision — 매 human IC 의 final call. LLM 매 advisor only.
|
|
|
|
## ❌ 안티패턴
|
|
- **Hero culture**: 매 single SRE 의 24/7 매 burnout + bus-factor 1.
|
|
- **Blame-game postmortem**: 매 culture 의 silence 야기.
|
|
- **Runbook rot**: 매 6-month-old runbook 매 broken commands.
|
|
- **Dashboard bloat**: 매 100+ panel 매 signal/noise 1:50.
|
|
- **Status page lag**: 매 customer 가 first 알림 — 매 trust 손실.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Google _SRE Workbook_ Ch.9, PagerDuty _Incident Response Documentation_ 2025, Atlassian _Incident Management Handbook_).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — incident phases, severity matrix, runbook, anti-patterns |
|