"매 symptom 의 X, root 의 fix". Toyota Production System (Ohno, 1950s) 에서 originate — 매 modern SRE/DevOps 의 incident postmortem 의 standard practice (Google SRE Book, Etsy blameless postmortem).
Symptom ≠ root: stop at first plausible cause = X RCA.
Multiple roots: 매 single-root 의 myth — 매 contributing factors 의 web.
매 응용
Production incident postmortems (SRE).
Bug investigation (recurring crashes).
Quality control (manufacturing defects).
Process failures (project delays).
💻 패턴
Postmortem template (markdown)
# Incident: [name] — YYYY-MM-DD
## Summary
1-paragraph what happened.
## Impact
- Users affected: N
- Duration: HH:MM
- Revenue/SLO: ...
## Timeline (UTC)
- HH:MM — alert fired
- HH:MM — engineer paged
- HH:MM — mitigation
- HH:MM — resolved
## Root cause
The proximate cause was X. The underlying root cause was Y because Z.
## What went well
- Detection time was <5min thanks to alert A.
## What went poorly
- Runbook was outdated.
## Action items
- [ ] [P0] Fix Y in service S — owner @alice — due YYYY-MM-DD
- [ ] [P1] Update runbook — owner @bob## Lessons learned
- ...
Fishbone (Ishikawa) categories
Problem: Database queries timing out
├── People: New engineer didn't index
├── Process: Migration review missed perf check
├── Tools: ORM hides slow query
├── Environment: Prod DB has 10x test data
├── Materials: Schema changed without index
└── Measurement: No p99 latency SLO
Logical inversion debugging
# Don't ask "why is it broken?"# Ask "why would it work?"defdiagnose(system):assumptions=list_assumptions(system)forainassumptions:ifnotverify(a):returnf"Failing assumption: {a}"return"All assumptions hold — symptom misread"
Bisection for regression
# Git bisect: binary search for root commit
git bisect start
git bisect bad HEAD # current is broken
git bisect good v2.2.0 # known good# git auto-checks out commits — test each, mark good/bad
git bisect run ./test.sh # automate
Causal graph (DAG)
importnetworkxasnxG=nx.DiGraph()G.add_edge("missing index","slow query")G.add_edge("slow query","request timeout")G.add_edge("request timeout","circuit breaker open")G.add_edge("circuit breaker open","503 errors")# Walk back from symptomancestors=nx.ancestors(G,"503 errors")print(ancestors)# all roots
Five-whys with LLM assist (2026)
fromanthropicimportAnthropicclient=Anthropic()deffive_whys(symptom:str,context:str):prompt=f"""Apply 5 Whys RCA to this incident.
Symptom: {symptom}Context: {context}Output 5 levels of "Why?" with concrete hypotheses.
End with a falsifiable root cause and a test to verify it."""returnclient.messages.create(model="claude-opus-4-7",max_tokens=2048,messages=[{"role":"user","content":prompt}]).content[0].text
매 결정 기준
상황
Approach
Single failure event
5 Whys
Complex multi-factor
Fishbone / Causal DAG
Regression in code
git bisect
Recurring incidents
Pareto analysis on RCA categories
Safety-critical (medical/aero)
Formal FMEA / FTA
기본값: 5 Whys + blameless postmortem template, escalate to fishbone if multi-root.
언제: Incident 후 postmortem 의 작성. Recurring bug 의 deep investigation. Symptom 의 multiple plausible causes 의 enumerate.
언제 X: Trivial bug (typo, off-by-one) — RCA overkill. Time-critical mitigation 의 phase — fix first, RCA later.
❌ 안티패턴
Blame culture: "Who broke it?" → people 의 hide info → RCA 의 fail.
Stop at first cause: "Engineer pushed bad code" 는 X root — process 의 why 의 ask.
No action items: insight 의 doc, but X follow-up → 매 same incident 의 repeat.
Single-root assumption: complex system 의 contributing factors 의 web — 매 multiple roots 의 normal.
🧪 검증 / 중복
Verified (Toyota TPS, Google SRE Book Ch.15, Etsy blameless postmortem culture).