Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

5.4 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Root Cause Analysis (RCA)

매 한 줄

"매 symptom 의 X, root 의 fix". Toyota Production System (Ohno, 1950s) 에서 originate — 매 modern SRE/DevOps 의 incident postmortem 의 standard practice (Google SRE Book, Etsy blameless postmortem).

매 핵심

매 5 Whys 기법

Symptom: "Server crashed."
Why? → "Out of memory."
Why? → "Memory leak in v2.3."
Why? → "Connection pool not closed on error."
Why? → "Test framework didn't cover error path."
Why? → "Coverage gate excluded except blocks." ← root.

매 핵심 원칙

Blameless: person 의 X, system 의 attack.
Symptom ≠ root: stop at first plausible cause = X RCA.
Multiple roots: 매 single-root 의 myth — 매 contributing factors 의 web.

매 응용

Production incident postmortems (SRE).
Bug investigation (recurring crashes).
Quality control (manufacturing defects).
Process failures (project delays).

💻 패턴

Postmortem template (markdown)

# Incident: [name] — YYYY-MM-DD

## Summary
1-paragraph what happened.

## Impact
- Users affected: N
- Duration: HH:MM
- Revenue/SLO: ...

## Timeline (UTC)
- HH:MM — alert fired
- HH:MM — engineer paged
- HH:MM — mitigation
- HH:MM — resolved

## Root cause
The proximate cause was X. The underlying root cause was Y because Z.

## What went well
- Detection time was <5min thanks to alert A.

## What went poorly
- Runbook was outdated.

## Action items
- [ ] [P0] Fix Y in service S — owner @alice — due YYYY-MM-DD
- [ ] [P1] Update runbook — owner @bob

## Lessons learned
- ...

Fishbone (Ishikawa) categories

Problem: Database queries timing out
├── People: New engineer didn't index
├── Process: Migration review missed perf check
├── Tools: ORM hides slow query
├── Environment: Prod DB has 10x test data
├── Materials: Schema changed without index
└── Measurement: No p99 latency SLO

Logical inversion debugging

# Don't ask "why is it broken?"
# Ask "why would it work?"
def diagnose(system):
    assumptions = list_assumptions(system)
    for a in assumptions:
        if not verify(a):
            return f"Failing assumption: {a}"
    return "All assumptions hold — symptom misread"

Bisection for regression

# Git bisect: binary search for root commit
git bisect start
git bisect bad HEAD          # current is broken
git bisect good v2.2.0       # known good
# git auto-checks out commits — test each, mark good/bad
git bisect run ./test.sh     # automate

Causal graph (DAG)

import networkx as nx

G = nx.DiGraph()
G.add_edge("missing index", "slow query")
G.add_edge("slow query", "request timeout")
G.add_edge("request timeout", "circuit breaker open")
G.add_edge("circuit breaker open", "503 errors")

# Walk back from symptom
ancestors = nx.ancestors(G, "503 errors")
print(ancestors)  # all roots

Five-whys with LLM assist (2026)

from anthropic import Anthropic
client = Anthropic()

def five_whys(symptom: str, context: str):
    prompt = f"""Apply 5 Whys RCA to this incident.
Symptom: {symptom}
Context: {context}

Output 5 levels of "Why?" with concrete hypotheses.
End with a falsifiable root cause and a test to verify it."""
    return client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

매 결정 기준

상황	Approach
Single failure event	5 Whys
Complex multi-factor	Fishbone / Causal DAG
Regression in code	git bisect
Recurring incidents	Pareto analysis on RCA categories
Safety-critical (medical/aero)	Formal FMEA / FTA

기본값: 5 Whys + blameless postmortem template, escalate to fishbone if multi-root.

🔗 Graph

부모: Problem_Solving
변형: Postmortem · FMEA
응용: Debugging · Quality-Control
Adjacent: Wicked-Problems · Causal-Inference

🤖 LLM 활용

언제: Incident 후 postmortem 의 작성. Recurring bug 의 deep investigation. Symptom 의 multiple plausible causes 의 enumerate. 언제 X: Trivial bug (typo, off-by-one) — RCA overkill. Time-critical mitigation 의 phase — fix first, RCA later.

❌ 안티패턴

Blame culture: "Who broke it?" → people 의 hide info → RCA 의 fail.
Stop at first cause: "Engineer pushed bad code" 는 X root — process 의 why 의 ask.
No action items: insight 의 doc, but X follow-up → 매 same incident 의 repeat.
Single-root assumption: complex system 의 contributing factors 의 web — 매 multiple roots 의 normal.

🧪 검증 / 중복

Verified (Toyota TPS, Google SRE Book Ch.15, Etsy blameless postmortem culture).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — RCA with 5 Whys, fishbone, postmortem template, LLM-assist

5.4 KiB Raw Blame History