Files
2nd/10_Wiki/Topics/Backend/Lessons Learned.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

3.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-lessons-learned Lessons Learned 10_Wiki/Topics verified self
Postmortem
Retrospective
After-Action Review
none A 0.9 applied
process
postmortem
learning
sre
2026-05-10 pending
language framework
n/a n/a

Lessons Learned

매 한 줄

"매 institutionalized regret-into-knowledge transformer". 매 1940s US Army After-Action Review (AAR) 의 origin → 매 2003 Google SRE 의 blameless postmortem 의 modern form. 매 each incident 매 paid-for data; throwing it 매 paying twice.

매 핵심

매 mechanism

  1. Incident / project ends.
  2. Timeline 매 reconstructed.
  3. Root causes (plural) 매 identified.
  4. Action items 매 owned + scheduled.
  5. Doc 매 published, indexed, re-read.

매 modern best practices (Google SRE)

  • Blameless — 매 systems 매 fail, not people.
  • Concrete action items with owners + due dates.
  • 5 Whys or Causal Analysis using STAMP (no single root cause).
  • Public within org (searchable).

매 응용

  1. Production incidents (PagerDuty integration).
  2. Project retros (sprint, quarter).
  3. Security incidents (legal-friendly variant).

💻 패턴

Postmortem template (Google SRE-style)

# Incident YYYY-MM-DD: <short title>

## Summary
1-2 sentences.

## Impact
- Users affected: ...
- Duration: ...
- Revenue: ...

## Root causes (plural)
1. ...
2. ...

## Trigger
What event started the incident.

## Resolution
What stopped it.

## Detection
How we knew (and how late).

## Timeline (UTC)
| Time | Event |
|---|---|
| 14:32 | Deploy started |
| 14:34 | Error rate spike |
| ... | ... |

## What went well
- ...

## What went poorly
- ...

## Where we got lucky
- ...

## Action items
| ID | Action | Owner | Due | Type |
|---|---|---|---|---|
| AI-1 | Add canary deploy | @alice | 2026-05-20 | prevent |
| AI-2 | Improve alert | @bob | 2026-05-15 | detect |

Action item tracker (GitHub-issue export)

gh issue create \
  --title "AI-1: Add canary deploy" \
  --label "postmortem,prevent" \
  --assignee alice \
  --milestone "Q2 2026"

5 Whys (causal chain)

Why did the site go down? Server OOM.
Why OOM? Cache grew unbounded.
Why unbounded? No eviction policy.
Why no policy? PR review missed it.
Why missed? Checklist had no cache item.
→ Root: missing checklist item (process), not the engineer.

Aggregation across postmortems (yearly review)

SELECT root_cause_category, COUNT(*) AS n, SUM(downtime_minutes) AS total_dt
FROM postmortems
WHERE date >= '2025-01-01'
GROUP BY root_cause_category
ORDER BY total_dt DESC;

Embedded retro into sprint

- ✅ Did → keep
- 🔄 Did → improve
- ❌ Did → stop
- 💡 Didn't → start

매 결정 기준

상황 Format
Production incident Google SRE postmortem
Sprint end Sailboat / Start-Stop-Continue
Project end After-Action Review

기본값: Blameless, concrete action items, public, re-read at 30 days.

🔗 Graph

🤖 LLM 활용

언제: post-incident, project end, security event. 언제 X: trivial bug fix (use commit message instead).

안티패턴

  • Blame culture: 매 hides root causes.
  • No follow-through on action items: 매 same incident again.
  • Single root cause: 매 systems-thinking 매 missed.
  • Document then forget: 매 unread postmortem 매 worthless.

🧪 검증 / 중복

  • Verified (Google SRE Book Ch. 15; Etsy "blameless postmortem" essay; US Army AAR doctrine).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Lessons Learned FULL with SRE template