f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
164 lines
3.8 KiB
Markdown
164 lines
3.8 KiB
Markdown
---
|
|
id: wiki-2026-0508-lessons-learned
|
|
title: Lessons Learned
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Postmortem, Retrospective, After-Action Review]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [process, postmortem, learning, sre]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: n/a
|
|
framework: n/a
|
|
---
|
|
|
|
# Lessons Learned
|
|
|
|
## 매 한 줄
|
|
> **"매 institutionalized regret-into-knowledge transformer"**. 매 1940s US Army After-Action Review (AAR) 의 origin → 매 2003 Google SRE 의 blameless postmortem 의 modern form. 매 each incident 매 paid-for data; throwing it 매 paying twice.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 mechanism
|
|
1. Incident / project ends.
|
|
2. Timeline 매 reconstructed.
|
|
3. Root causes (plural) 매 identified.
|
|
4. Action items 매 owned + scheduled.
|
|
5. Doc 매 published, indexed, re-read.
|
|
|
|
### 매 modern best practices (Google SRE)
|
|
- **Blameless** — 매 systems 매 fail, not people.
|
|
- **Concrete action items** with owners + due dates.
|
|
- **5 Whys** or **Causal Analysis using STAMP** (no single root cause).
|
|
- **Public** within org (searchable).
|
|
|
|
### 매 응용
|
|
1. Production incidents (PagerDuty integration).
|
|
2. Project retros (sprint, quarter).
|
|
3. Security incidents (legal-friendly variant).
|
|
|
|
## 💻 패턴
|
|
|
|
### Postmortem template (Google SRE-style)
|
|
```markdown
|
|
# Incident YYYY-MM-DD: <short title>
|
|
|
|
## Summary
|
|
1-2 sentences.
|
|
|
|
## Impact
|
|
- Users affected: ...
|
|
- Duration: ...
|
|
- Revenue: ...
|
|
|
|
## Root causes (plural)
|
|
1. ...
|
|
2. ...
|
|
|
|
## Trigger
|
|
What event started the incident.
|
|
|
|
## Resolution
|
|
What stopped it.
|
|
|
|
## Detection
|
|
How we knew (and how late).
|
|
|
|
## Timeline (UTC)
|
|
| Time | Event |
|
|
|---|---|
|
|
| 14:32 | Deploy started |
|
|
| 14:34 | Error rate spike |
|
|
| ... | ... |
|
|
|
|
## What went well
|
|
- ...
|
|
|
|
## What went poorly
|
|
- ...
|
|
|
|
## Where we got lucky
|
|
- ...
|
|
|
|
## Action items
|
|
| ID | Action | Owner | Due | Type |
|
|
|---|---|---|---|---|
|
|
| AI-1 | Add canary deploy | @alice | 2026-05-20 | prevent |
|
|
| AI-2 | Improve alert | @bob | 2026-05-15 | detect |
|
|
```
|
|
|
|
### Action item tracker (GitHub-issue export)
|
|
```bash
|
|
gh issue create \
|
|
--title "AI-1: Add canary deploy" \
|
|
--label "postmortem,prevent" \
|
|
--assignee alice \
|
|
--milestone "Q2 2026"
|
|
```
|
|
|
|
### 5 Whys (causal chain)
|
|
```
|
|
Why did the site go down? Server OOM.
|
|
Why OOM? Cache grew unbounded.
|
|
Why unbounded? No eviction policy.
|
|
Why no policy? PR review missed it.
|
|
Why missed? Checklist had no cache item.
|
|
→ Root: missing checklist item (process), not the engineer.
|
|
```
|
|
|
|
### Aggregation across postmortems (yearly review)
|
|
```sql
|
|
SELECT root_cause_category, COUNT(*) AS n, SUM(downtime_minutes) AS total_dt
|
|
FROM postmortems
|
|
WHERE date >= '2025-01-01'
|
|
GROUP BY root_cause_category
|
|
ORDER BY total_dt DESC;
|
|
```
|
|
|
|
### Embedded retro into sprint
|
|
```markdown
|
|
- ✅ Did → keep
|
|
- 🔄 Did → improve
|
|
- ❌ Did → stop
|
|
- 💡 Didn't → start
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Format |
|
|
|---|---|
|
|
| Production incident | Google SRE postmortem |
|
|
| Sprint end | Sailboat / Start-Stop-Continue |
|
|
| Project end | After-Action Review |
|
|
|
|
**기본값**: Blameless, concrete action items, public, re-read at 30 days.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[SRE]]
|
|
- 변형: [[After-Action-Review]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: post-incident, project end, security event.
|
|
**언제 X**: trivial bug fix (use commit message instead).
|
|
|
|
## ❌ 안티패턴
|
|
- **Blame culture**: 매 hides root causes.
|
|
- **No follow-through on action items**: 매 same incident again.
|
|
- **Single root cause**: 매 systems-thinking 매 missed.
|
|
- **Document then forget**: 매 unread postmortem 매 worthless.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Google SRE Book Ch. 15; Etsy "blameless postmortem" essay; US Army AAR doctrine).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — Lessons Learned FULL with SRE template |
|