Files
2nd/10_Wiki/Topics/AI_and_ML/Science of Failure.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

194 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-science-of-failure
title: Science of Failure
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Failure Science, Postmortem Culture, Learning from Failure]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [reliability, postmortem, sre, chaos-engineering, learning-org]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: english
framework: SRE
---
# Science of Failure
## 매 한 줄
> **"매 failure 는 system 의 information signal — 매 blame 의 X, 매 learning 의 O"**. 매 origin 은 1979 Three Mile Island 와 NASA Challenger postmortem culture; 매 modern state 는 Google SRE blameless postmortem, Netflix Chaos Monkey, Honeycomb observability + AI-aided incident review (Claude Opus 4.7 transcript summarization).
## 매 핵심
### 매 failure 의 분류 (Westrum 1988 → 매 현대 적용)
- **Pathological**: 매 messenger shoot, 매 hide failure → 매 pre-mortem culture.
- **Bureaucratic**: 매 narrow responsibility, 매 novelty crush.
- **Generative**: 매 high cooperation, 매 inquiry, 매 messenger trained — 매 Google/Netflix 의 target.
### 매 blameless postmortem 의 5 components
- **Timeline**: UTC, 매 minute precision.
- **Impact**: user-facing metric (RPS, error budget burn).
- **Root cause**: 매 5 whys + contributing factors.
- **Action items**: owner + due date.
- **Lessons**: 매 process change, 매 not individual blame.
### 매 응용
1. SRE error budget — 매 SLO violation 시 launch freeze.
2. Chaos engineering — 매 prod fault injection 으로 latent failure surface.
3. Pre-mortem — 매 launch 전 "matrix this failed, why?".
4. Game days — 매 quarterly disaster sim.
## 💻 패턴
### 매 blameless postmortem template (Markdown)
```markdown
# Incident: <name> (YYYY-MM-DD)
**Severity**: SEV-2
**Duration**: 47 min (14:0314:50 UTC)
**Impact**: 12% of /api/v2 requests 5xx
**On-call**: @alice (commander), @bob (comms)
## Timeline (UTC)
- 14:03 — deploy v2.41.0 to prod
- 14:05 — error rate alarm fires (PagerDuty)
- 14:12 — rollback initiated
- 14:50 — error rate normal
## Root cause
DB migration added NOT NULL on `users.email` w/o backfill.
Old code paths (canary not yet drained) wrote NULL → constraint violation.
## Contributing factors
- Migration runner did not block on canary drain (process gap)
- Schema diff review missed NOT NULL implication (review gap)
## Action items
- [ ] @alice — migration runner: enforce canary-drain gate (P0, 2026-05-17)
- [ ] @bob — schema-diff bot: flag NOT NULL on existing column (P1, 2026-05-24)
## What went well
- Rollback under 10 min (rollback runbook v3 worked)
- On-call comms was fast
## What did not
- Canary drain assumption was tribal knowledge
## Lessons
Migration-runner gate is the structural fix.
Not "alice should have known" — process is the fix.
```
### 매 5-whys (chained, 매 not individual blame)
```text
Why 5xx? → DB constraint violation
Why violation? → NULL written to NOT NULL col
Why NULL? → old canary still running old code
Why canary running? → migration ran w/o waiting for canary drain
Why no wait? → migration runner has no canary-state hook
→ FIX: migration runner must check canary state
```
### 매 chaos monkey (매 Litmus / Chaos Mesh, K8s native, 2026)
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-payments-pod-randomly
spec:
action: pod-kill
mode: one
selector:
namespaces: [payments]
labelSelectors:
app: payments-api
scheduler:
cron: "@every 30m" # 매 prod hour 동, 매 random pod kill
```
### 매 error budget burn alert (Google SRE, multi-window)
```yaml
# 매 fast burn (1h window, 14.4x rate) + slow burn (6h, 6x) — 2-window
- alert: SLOFastBurn
expr: |
(1 - sum(rate(http_requests_success[1h])) / sum(rate(http_requests_total[1h])))
> (1 - 0.999) * 14.4
labels: { severity: page }
annotations: { summary: "Burning SLO 14.4x — page on-call" }
- alert: SLOSlowBurn
expr: |
(1 - sum(rate(http_requests_success[6h])) / sum(rate(http_requests_total[6h])))
> (1 - 0.999) * 6
labels: { severity: ticket }
```
### 매 pre-mortem prompt (매 team session)
```text
"매 6개월 후 — 매 launch 가 catastrophic failure.
매 NYTimes headline 이 'Company X loses $100M'.
매 어떻게 그 일이 일어났을지 — 매 5 most likely scenarios 작성."
→ 매 pre-mortem 은 cognitive bias (overconfidence) 회피, 매 risk 표면화.
```
### 매 incident summarizer (Claude Opus 4.7, transcript → postmortem draft)
```python
import anthropic
client = anthropic.Anthropic()
slack_log = open("incident-2026-05-09.log").read()
msg = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
system=(
"You are an SRE writing a blameless postmortem. "
"Extract: timeline (UTC), impact, root cause (5 whys), "
"contributing factors, action items. Never name-blame; "
"frame failures as process gaps."
),
messages=[{"role": "user", "content": slack_log}],
)
print(msg.content[0].text)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| 매 SEV-1 user-impacting | full blameless postmortem (24h SLA) |
| 매 SEV-3 internal-only | lightweight 5-whys (1 page) |
| 매 near-miss (no impact) | "near-miss log" — 매 still learn |
| 매 individual error pattern | 매 process gap 분석 (매 PIP X) |
**기본값**: 매 SEV-2+ → blameless postmortem with action items + owners.
## 🔗 Graph
- 부모: [[SRE]]
- 변형: [[Chaos Engineering]]
- 응용: [[Postmortem]]
## 🤖 LLM 활용
**언제**: 매 Slack/PagerDuty transcript → postmortem first draft (Claude Opus 4.7 1M ctx 으로 매 long incident 통째로). 매 5-whys facilitation.
**언제 X**: 매 root cause 의 final attribution — 매 human judgment 필요. 매 LLM 의 "blame" hallucination 위험.
## ❌ 안티패턴
- **Blame culture**: 매 "who screwed up?" → 매 hide future failure.
- **Action-item theater**: 매 owner X, due date X → 매 never done.
- **Single root cause**: 매 real failure 는 multi-factor — 매 swiss-cheese model.
- **Postmortem-as-punishment**: 매 PIP 와 결합 → 매 honesty 죽음.
## 🧪 검증 / 중복
- Verified (Google SRE Book Ch.15, Westrum 1988, Sidney Dekker "Field Guide to Understanding Human Error").
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — blameless postmortem + chaos eng + LLM-aided draft |