--- id: wiki-2026-0508-lessons-learned title: Lessons Learned category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Postmortem, Retrospective, After-Action Review] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [process, postmortem, learning, sre] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: n/a framework: n/a --- # Lessons Learned ## 매 한 줄 > **"매 institutionalized regret-into-knowledge transformer"**. 매 1940s US Army After-Action Review (AAR) 의 origin → 매 2003 Google SRE 의 blameless postmortem 의 modern form. 매 each incident 매 paid-for data; throwing it 매 paying twice. ## 매 핵심 ### 매 mechanism 1. Incident / project ends. 2. Timeline 매 reconstructed. 3. Root causes (plural) 매 identified. 4. Action items 매 owned + scheduled. 5. Doc 매 published, indexed, re-read. ### 매 modern best practices (Google SRE) - **Blameless** — 매 systems 매 fail, not people. - **Concrete action items** with owners + due dates. - **5 Whys** or **Causal Analysis using STAMP** (no single root cause). - **Public** within org (searchable). ### 매 응용 1. Production incidents (PagerDuty integration). 2. Project retros (sprint, quarter). 3. Security incidents (legal-friendly variant). ## 💻 패턴 ### Postmortem template (Google SRE-style) ```markdown # Incident YYYY-MM-DD: ## Summary 1-2 sentences. ## Impact - Users affected: ... - Duration: ... - Revenue: ... ## Root causes (plural) 1. ... 2. ... ## Trigger What event started the incident. ## Resolution What stopped it. ## Detection How we knew (and how late). ## Timeline (UTC) | Time | Event | |---|---| | 14:32 | Deploy started | | 14:34 | Error rate spike | | ... | ... | ## What went well - ... ## What went poorly - ... ## Where we got lucky - ... ## Action items | ID | Action | Owner | Due | Type | |---|---|---|---|---| | AI-1 | Add canary deploy | @alice | 2026-05-20 | prevent | | AI-2 | Improve alert | @bob | 2026-05-15 | detect | ``` ### Action item tracker (GitHub-issue export) ```bash gh issue create \ --title "AI-1: Add canary deploy" \ --label "postmortem,prevent" \ --assignee alice \ --milestone "Q2 2026" ``` ### 5 Whys (causal chain) ``` Why did the site go down? Server OOM. Why OOM? Cache grew unbounded. Why unbounded? No eviction policy. Why no policy? PR review missed it. Why missed? Checklist had no cache item. → Root: missing checklist item (process), not the engineer. ``` ### Aggregation across postmortems (yearly review) ```sql SELECT root_cause_category, COUNT(*) AS n, SUM(downtime_minutes) AS total_dt FROM postmortems WHERE date >= '2025-01-01' GROUP BY root_cause_category ORDER BY total_dt DESC; ``` ### Embedded retro into sprint ```markdown - ✅ Did → keep - 🔄 Did → improve - ❌ Did → stop - 💡 Didn't → start ``` ## 매 결정 기준 | 상황 | Format | |---|---| | Production incident | Google SRE postmortem | | Sprint end | Sailboat / Start-Stop-Continue | | Project end | After-Action Review | **기본값**: Blameless, concrete action items, public, re-read at 30 days. ## 🔗 Graph - 부모: [[SRE]] - 변형: [[After-Action-Review]] ## 🤖 LLM 활용 **언제**: post-incident, project end, security event. **언제 X**: trivial bug fix (use commit message instead). ## ❌ 안티패턴 - **Blame culture**: 매 hides root causes. - **No follow-through on action items**: 매 same incident again. - **Single root cause**: 매 systems-thinking 매 missed. - **Document then forget**: 매 unread postmortem 매 worthless. ## 🧪 검증 / 중복 - Verified (Google SRE Book Ch. 15; Etsy "blameless postmortem" essay; US Army AAR doctrine). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Lessons Learned FULL with SRE template |