Files
2nd/10_Wiki/Topics/Computer_Science_and_Theory/Root-Cause-Analysis-RCA.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.4 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-root-cause-analysis-rca Root Cause Analysis (RCA) 10_Wiki/Topics verified self
RCA
5 Whys
Fishbone Analysis
none A 0.9 applied
debugging
problem-solving
methodology
sre
2026-05-10 pending
language framework
Methodology SRE/DevOps

Root Cause Analysis (RCA)

매 한 줄

"매 symptom 의 X, root 의 fix". Toyota Production System (Ohno, 1950s) 에서 originate — 매 modern SRE/DevOps 의 incident postmortem 의 standard practice (Google SRE Book, Etsy blameless postmortem).

매 핵심

매 5 Whys 기법

  1. Symptom: "Server crashed."
  2. Why? → "Out of memory."
  3. Why? → "Memory leak in v2.3."
  4. Why? → "Connection pool not closed on error."
  5. Why? → "Test framework didn't cover error path."
  6. Why? → "Coverage gate excluded except blocks." ← root.

매 핵심 원칙

  • Blameless: person 의 X, system 의 attack.
  • Symptom ≠ root: stop at first plausible cause = X RCA.
  • Multiple roots: 매 single-root 의 myth — 매 contributing factors 의 web.

매 응용

  1. Production incident postmortems (SRE).
  2. Bug investigation (recurring crashes).
  3. Quality control (manufacturing defects).
  4. Process failures (project delays).

💻 패턴

Postmortem template (markdown)

# Incident: [name] — YYYY-MM-DD

## Summary
1-paragraph what happened.

## Impact
- Users affected: N
- Duration: HH:MM
- Revenue/SLO: ...

## Timeline (UTC)
- HH:MM — alert fired
- HH:MM — engineer paged
- HH:MM — mitigation
- HH:MM — resolved

## Root cause
The proximate cause was X. The underlying root cause was Y because Z.

## What went well
- Detection time was <5min thanks to alert A.

## What went poorly
- Runbook was outdated.

## Action items
- [ ] [P0] Fix Y in service S — owner @alice — due YYYY-MM-DD
- [ ] [P1] Update runbook — owner @bob

## Lessons learned
- ...

Fishbone (Ishikawa) categories

Problem: Database queries timing out
├── People: New engineer didn't index
├── Process: Migration review missed perf check
├── Tools: ORM hides slow query
├── Environment: Prod DB has 10x test data
├── Materials: Schema changed without index
└── Measurement: No p99 latency SLO

Logical inversion debugging

# Don't ask "why is it broken?"
# Ask "why would it work?"
def diagnose(system):
    assumptions = list_assumptions(system)
    for a in assumptions:
        if not verify(a):
            return f"Failing assumption: {a}"
    return "All assumptions hold — symptom misread"

Bisection for regression

# Git bisect: binary search for root commit
git bisect start
git bisect bad HEAD          # current is broken
git bisect good v2.2.0       # known good
# git auto-checks out commits — test each, mark good/bad
git bisect run ./test.sh     # automate

Causal graph (DAG)

import networkx as nx

G = nx.DiGraph()
G.add_edge("missing index", "slow query")
G.add_edge("slow query", "request timeout")
G.add_edge("request timeout", "circuit breaker open")
G.add_edge("circuit breaker open", "503 errors")

# Walk back from symptom
ancestors = nx.ancestors(G, "503 errors")
print(ancestors)  # all roots

Five-whys with LLM assist (2026)

from anthropic import Anthropic
client = Anthropic()

def five_whys(symptom: str, context: str):
    prompt = f"""Apply 5 Whys RCA to this incident.
Symptom: {symptom}
Context: {context}

Output 5 levels of "Why?" with concrete hypotheses.
End with a falsifiable root cause and a test to verify it."""
    return client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

매 결정 기준

상황 Approach
Single failure event 5 Whys
Complex multi-factor Fishbone / Causal DAG
Regression in code git bisect
Recurring incidents Pareto analysis on RCA categories
Safety-critical (medical/aero) Formal FMEA / FTA

기본값: 5 Whys + blameless postmortem template, escalate to fishbone if multi-root.

🔗 Graph

🤖 LLM 활용

언제: Incident 후 postmortem 의 작성. Recurring bug 의 deep investigation. Symptom 의 multiple plausible causes 의 enumerate. 언제 X: Trivial bug (typo, off-by-one) — RCA overkill. Time-critical mitigation 의 phase — fix first, RCA later.

안티패턴

  • Blame culture: "Who broke it?" → people 의 hide info → RCA 의 fail.
  • Stop at first cause: "Engineer pushed bad code" 는 X root — process 의 why 의 ask.
  • No action items: insight 의 doc, but X follow-up → 매 same incident 의 repeat.
  • Single-root assumption: complex system 의 contributing factors 의 web — 매 multiple roots 의 normal.

🧪 검증 / 중복

  • Verified (Toyota TPS, Google SRE Book Ch.15, Etsy blameless postmortem culture).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — RCA with 5 Whys, fishbone, postmortem template, LLM-assist