Files
2nd/10_Wiki/Topics/Backend/Lessons Learned.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

164 lines
3.8 KiB
Markdown

---
id: wiki-2026-0508-lessons-learned
title: Lessons Learned
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Postmortem, Retrospective, After-Action Review]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [process, postmortem, learning, sre]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: n/a
framework: n/a
---
# Lessons Learned
## 매 한 줄
> **"매 institutionalized regret-into-knowledge transformer"**. 매 1940s US Army After-Action Review (AAR) 의 origin → 매 2003 Google SRE 의 blameless postmortem 의 modern form. 매 each incident 매 paid-for data; throwing it 매 paying twice.
## 매 핵심
### 매 mechanism
1. Incident / project ends.
2. Timeline 매 reconstructed.
3. Root causes (plural) 매 identified.
4. Action items 매 owned + scheduled.
5. Doc 매 published, indexed, re-read.
### 매 modern best practices (Google SRE)
- **Blameless** — 매 systems 매 fail, not people.
- **Concrete action items** with owners + due dates.
- **5 Whys** or **Causal Analysis using STAMP** (no single root cause).
- **Public** within org (searchable).
### 매 응용
1. Production incidents (PagerDuty integration).
2. Project retros (sprint, quarter).
3. Security incidents (legal-friendly variant).
## 💻 패턴
### Postmortem template (Google SRE-style)
```markdown
# Incident YYYY-MM-DD: <short title>
## Summary
1-2 sentences.
## Impact
- Users affected: ...
- Duration: ...
- Revenue: ...
## Root causes (plural)
1. ...
2. ...
## Trigger
What event started the incident.
## Resolution
What stopped it.
## Detection
How we knew (and how late).
## Timeline (UTC)
| Time | Event |
|---|---|
| 14:32 | Deploy started |
| 14:34 | Error rate spike |
| ... | ... |
## What went well
- ...
## What went poorly
- ...
## Where we got lucky
- ...
## Action items
| ID | Action | Owner | Due | Type |
|---|---|---|---|---|
| AI-1 | Add canary deploy | @alice | 2026-05-20 | prevent |
| AI-2 | Improve alert | @bob | 2026-05-15 | detect |
```
### Action item tracker (GitHub-issue export)
```bash
gh issue create \
--title "AI-1: Add canary deploy" \
--label "postmortem,prevent" \
--assignee alice \
--milestone "Q2 2026"
```
### 5 Whys (causal chain)
```
Why did the site go down? Server OOM.
Why OOM? Cache grew unbounded.
Why unbounded? No eviction policy.
Why no policy? PR review missed it.
Why missed? Checklist had no cache item.
→ Root: missing checklist item (process), not the engineer.
```
### Aggregation across postmortems (yearly review)
```sql
SELECT root_cause_category, COUNT(*) AS n, SUM(downtime_minutes) AS total_dt
FROM postmortems
WHERE date >= '2025-01-01'
GROUP BY root_cause_category
ORDER BY total_dt DESC;
```
### Embedded retro into sprint
```markdown
- ✅ Did → keep
- 🔄 Did → improve
- ❌ Did → stop
- 💡 Didn't → start
```
## 매 결정 기준
| 상황 | Format |
|---|---|
| Production incident | Google SRE postmortem |
| Sprint end | Sailboat / Start-Stop-Continue |
| Project end | After-Action Review |
**기본값**: Blameless, concrete action items, public, re-read at 30 days.
## 🔗 Graph
- 부모: [[SRE]]
- 변형: [[After-Action-Review]]
## 🤖 LLM 활용
**언제**: post-incident, project end, security event.
**언제 X**: trivial bug fix (use commit message instead).
## ❌ 안티패턴
- **Blame culture**: 매 hides root causes.
- **No follow-through on action items**: 매 same incident again.
- **Single root cause**: 매 systems-thinking 매 missed.
- **Document then forget**: 매 unread postmortem 매 worthless.
## 🧪 검증 / 중복
- Verified (Google SRE Book Ch. 15; Etsy "blameless postmortem" essay; US Army AAR doctrine).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Lessons Learned FULL with SRE template |