--- id: wiki-2026-0508-science-of-failure title: Science of Failure category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Failure Science, Postmortem Culture, Learning from Failure] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [reliability, postmortem, sre, chaos-engineering, learning-org] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: english framework: SRE --- # Science of Failure ## 매 한 줄 > **"매 failure 는 system 의 information signal — 매 blame 의 X, 매 learning 의 O"**. 매 origin 은 1979 Three Mile Island 와 NASA Challenger postmortem culture; 매 modern state 는 Google SRE blameless postmortem, Netflix Chaos Monkey, Honeycomb observability + AI-aided incident review (Claude Opus 4.7 transcript summarization). ## 매 핵심 ### 매 failure 의 분류 (Westrum 1988 → 매 현대 적용) - **Pathological**: 매 messenger shoot, 매 hide failure → 매 pre-mortem culture. - **Bureaucratic**: 매 narrow responsibility, 매 novelty crush. - **Generative**: 매 high cooperation, 매 inquiry, 매 messenger trained — 매 Google/Netflix 의 target. ### 매 blameless postmortem 의 5 components - **Timeline**: UTC, 매 minute precision. - **Impact**: user-facing metric (RPS, error budget burn). - **Root cause**: 매 5 whys + contributing factors. - **Action items**: owner + due date. - **Lessons**: 매 process change, 매 not individual blame. ### 매 응용 1. SRE error budget — 매 SLO violation 시 launch freeze. 2. Chaos engineering — 매 prod fault injection 으로 latent failure surface. 3. Pre-mortem — 매 launch 전 "matrix this failed, why?". 4. Game days — 매 quarterly disaster sim. ## 💻 패턴 ### 매 blameless postmortem template (Markdown) ```markdown # Incident: (YYYY-MM-DD) **Severity**: SEV-2 **Duration**: 47 min (14:03–14:50 UTC) **Impact**: 12% of /api/v2 requests 5xx **On-call**: @alice (commander), @bob (comms) ## Timeline (UTC) - 14:03 — deploy v2.41.0 to prod - 14:05 — error rate alarm fires (PagerDuty) - 14:12 — rollback initiated - 14:50 — error rate normal ## Root cause DB migration added NOT NULL on `users.email` w/o backfill. Old code paths (canary not yet drained) wrote NULL → constraint violation. ## Contributing factors - Migration runner did not block on canary drain (process gap) - Schema diff review missed NOT NULL implication (review gap) ## Action items - [ ] @alice — migration runner: enforce canary-drain gate (P0, 2026-05-17) - [ ] @bob — schema-diff bot: flag NOT NULL on existing column (P1, 2026-05-24) ## What went well - Rollback under 10 min (rollback runbook v3 worked) - On-call comms was fast ## What did not - Canary drain assumption was tribal knowledge ## Lessons Migration-runner gate is the structural fix. Not "alice should have known" — process is the fix. ``` ### 매 5-whys (chained, 매 not individual blame) ```text Why 5xx? → DB constraint violation Why violation? → NULL written to NOT NULL col Why NULL? → old canary still running old code Why canary running? → migration ran w/o waiting for canary drain Why no wait? → migration runner has no canary-state hook → FIX: migration runner must check canary state ``` ### 매 chaos monkey (매 Litmus / Chaos Mesh, K8s native, 2026) ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: kill-payments-pod-randomly spec: action: pod-kill mode: one selector: namespaces: [payments] labelSelectors: app: payments-api scheduler: cron: "@every 30m" # 매 prod hour 동, 매 random pod kill ``` ### 매 error budget burn alert (Google SRE, multi-window) ```yaml # 매 fast burn (1h window, 14.4x rate) + slow burn (6h, 6x) — 2-window - alert: SLOFastBurn expr: | (1 - sum(rate(http_requests_success[1h])) / sum(rate(http_requests_total[1h]))) > (1 - 0.999) * 14.4 labels: { severity: page } annotations: { summary: "Burning SLO 14.4x — page on-call" } - alert: SLOSlowBurn expr: | (1 - sum(rate(http_requests_success[6h])) / sum(rate(http_requests_total[6h]))) > (1 - 0.999) * 6 labels: { severity: ticket } ``` ### 매 pre-mortem prompt (매 team session) ```text "매 6개월 후 — 매 launch 가 catastrophic failure. 매 NYTimes headline 이 'Company X loses $100M'. 매 어떻게 그 일이 일어났을지 — 매 5 most likely scenarios 작성." → 매 pre-mortem 은 cognitive bias (overconfidence) 회피, 매 risk 표면화. ``` ### 매 incident summarizer (Claude Opus 4.7, transcript → postmortem draft) ```python import anthropic client = anthropic.Anthropic() slack_log = open("incident-2026-05-09.log").read() msg = client.messages.create( model="claude-opus-4-7", max_tokens=4096, system=( "You are an SRE writing a blameless postmortem. " "Extract: timeline (UTC), impact, root cause (5 whys), " "contributing factors, action items. Never name-blame; " "frame failures as process gaps." ), messages=[{"role": "user", "content": slack_log}], ) print(msg.content[0].text) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 매 SEV-1 user-impacting | full blameless postmortem (24h SLA) | | 매 SEV-3 internal-only | lightweight 5-whys (1 page) | | 매 near-miss (no impact) | "near-miss log" — 매 still learn | | 매 individual error pattern | 매 process gap 분석 (매 PIP X) | **기본값**: 매 SEV-2+ → blameless postmortem with action items + owners. ## 🔗 Graph - 부모: [[SRE]] - 변형: [[Chaos Engineering]] - 응용: [[Postmortem]] ## 🤖 LLM 활용 **언제**: 매 Slack/PagerDuty transcript → postmortem first draft (Claude Opus 4.7 1M ctx 으로 매 long incident 통째로). 매 5-whys facilitation. **언제 X**: 매 root cause 의 final attribution — 매 human judgment 필요. 매 LLM 의 "blame" hallucination 위험. ## ❌ 안티패턴 - **Blame culture**: 매 "who screwed up?" → 매 hide future failure. - **Action-item theater**: 매 owner X, due date X → 매 never done. - **Single root cause**: 매 real failure 는 multi-factor — 매 swiss-cheese model. - **Postmortem-as-punishment**: 매 PIP 와 결합 → 매 honesty 죽음. ## 🧪 검증 / 중복 - Verified (Google SRE Book Ch.15, Westrum 1988, Sidney Dekker "Field Guide to Understanding Human Error"). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — blameless postmortem + chaos eng + LLM-aided draft |