f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.4 KiB
5.4 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-root-cause-analysis-rca | Root Cause Analysis (RCA) | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
Root Cause Analysis (RCA)
매 한 줄
"매 symptom 의 X, root 의 fix". Toyota Production System (Ohno, 1950s) 에서 originate — 매 modern SRE/DevOps 의 incident postmortem 의 standard practice (Google SRE Book, Etsy blameless postmortem).
매 핵심
매 5 Whys 기법
- Symptom: "Server crashed."
- Why? → "Out of memory."
- Why? → "Memory leak in v2.3."
- Why? → "Connection pool not closed on error."
- Why? → "Test framework didn't cover error path."
- Why? → "Coverage gate excluded
exceptblocks." ← root.
매 핵심 원칙
- Blameless: person 의 X, system 의 attack.
- Symptom ≠ root: stop at first plausible cause = X RCA.
- Multiple roots: 매 single-root 의 myth — 매 contributing factors 의 web.
매 응용
- Production incident postmortems (SRE).
- Bug investigation (recurring crashes).
- Quality control (manufacturing defects).
- Process failures (project delays).
💻 패턴
Postmortem template (markdown)
# Incident: [name] — YYYY-MM-DD
## Summary
1-paragraph what happened.
## Impact
- Users affected: N
- Duration: HH:MM
- Revenue/SLO: ...
## Timeline (UTC)
- HH:MM — alert fired
- HH:MM — engineer paged
- HH:MM — mitigation
- HH:MM — resolved
## Root cause
The proximate cause was X. The underlying root cause was Y because Z.
## What went well
- Detection time was <5min thanks to alert A.
## What went poorly
- Runbook was outdated.
## Action items
- [ ] [P0] Fix Y in service S — owner @alice — due YYYY-MM-DD
- [ ] [P1] Update runbook — owner @bob
## Lessons learned
- ...
Fishbone (Ishikawa) categories
Problem: Database queries timing out
├── People: New engineer didn't index
├── Process: Migration review missed perf check
├── Tools: ORM hides slow query
├── Environment: Prod DB has 10x test data
├── Materials: Schema changed without index
└── Measurement: No p99 latency SLO
Logical inversion debugging
# Don't ask "why is it broken?"
# Ask "why would it work?"
def diagnose(system):
assumptions = list_assumptions(system)
for a in assumptions:
if not verify(a):
return f"Failing assumption: {a}"
return "All assumptions hold — symptom misread"
Bisection for regression
# Git bisect: binary search for root commit
git bisect start
git bisect bad HEAD # current is broken
git bisect good v2.2.0 # known good
# git auto-checks out commits — test each, mark good/bad
git bisect run ./test.sh # automate
Causal graph (DAG)
import networkx as nx
G = nx.DiGraph()
G.add_edge("missing index", "slow query")
G.add_edge("slow query", "request timeout")
G.add_edge("request timeout", "circuit breaker open")
G.add_edge("circuit breaker open", "503 errors")
# Walk back from symptom
ancestors = nx.ancestors(G, "503 errors")
print(ancestors) # all roots
Five-whys with LLM assist (2026)
from anthropic import Anthropic
client = Anthropic()
def five_whys(symptom: str, context: str):
prompt = f"""Apply 5 Whys RCA to this incident.
Symptom: {symptom}
Context: {context}
Output 5 levels of "Why?" with concrete hypotheses.
End with a falsifiable root cause and a test to verify it."""
return client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
).content[0].text
매 결정 기준
| 상황 | Approach |
|---|---|
| Single failure event | 5 Whys |
| Complex multi-factor | Fishbone / Causal DAG |
| Regression in code | git bisect |
| Recurring incidents | Pareto analysis on RCA categories |
| Safety-critical (medical/aero) | Formal FMEA / FTA |
기본값: 5 Whys + blameless postmortem template, escalate to fishbone if multi-root.
🔗 Graph
- 부모: Problem_Solving
- 변형: Postmortem · FMEA
- 응용: Debugging · Quality-Control
- Adjacent: Wicked-Problems · Causal-Inference
🤖 LLM 활용
언제: Incident 후 postmortem 의 작성. Recurring bug 의 deep investigation. Symptom 의 multiple plausible causes 의 enumerate. 언제 X: Trivial bug (typo, off-by-one) — RCA overkill. Time-critical mitigation 의 phase — fix first, RCA later.
❌ 안티패턴
- Blame culture: "Who broke it?" → people 의 hide info → RCA 의 fail.
- Stop at first cause: "Engineer pushed bad code" 는 X root — process 의 why 의 ask.
- No action items: insight 의 doc, but X follow-up → 매 same incident 의 repeat.
- Single-root assumption: complex system 의 contributing factors 의 web — 매 multiple roots 의 normal.
🧪 검증 / 중복
- Verified (Toyota TPS, Google SRE Book Ch.15, Etsy blameless postmortem culture).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — RCA with 5 Whys, fishbone, postmortem template, LLM-assist |