7.9 KiB
7.9 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| productivity-postmortem | Postmortem — Blameless / 학습 / 개선 | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Postmortem
Incident 후 학습 문서. Blameless = 사람 X 시스템 O. Timeline + RCA + 개선 액션. Google SRE / Etsy / Stripe 표준.
📖 핵심 개념
- Blameless: 비난 X. System / process 가 잘못.
- Timeline: 정확 시간 순.
- Root cause: 5 whys / fishbone.
- Action items: SMART (Specific, Measurable, Assignable).
💻 코드 패턴
Postmortem template
# Incident: API outage 2026-05-09
**Status:** Resolved
**Severity:** P1 (high)
**Duration:** 14:23 - 15:47 UTC (1h 24min)
**Author:** Alice
**Reviewers:** Bob, Carol
## Summary
새 deploy 의 SQL migration 가 production DB 의 orders table 에 30분 lock — 모든 read query 502.
## Impact
- Users affected: ~12,000 (35% of MAU)
- Failed requests: ~150,000
- Revenue impact: ~$5,000 estimated
- SLO breach: 99.9% → 98.7%
## Timeline (UTC)
- 14:00 — Migration PR merged (CI passed, no review concern)
- 14:23 — Production deploy starts
- 14:23 — Migration begins: `ALTER TABLE orders ADD COLUMN ... NOT NULL`
- 14:25 — DB CPU spike detected (PagerDuty alert)
- 14:26 — Oncall (Bob) acknowledges
- 14:30 — First user report (Twitter)
- 14:35 — Oncall identifies migration as cause
- 14:40 — Decision: cancel migration?
- 14:43 — Cancel attempt fails (lock too deep)
- 14:55 — Wait for completion
- 15:30 — Migration completes
- 15:35 — Service partially restored
- 15:47 — Full recovery confirmed
- 16:00 — Status page resolved
## Root cause
ALTER TABLE on a 50M-row table with NOT NULL DEFAULT triggered full table rewrite in Postgres < 11. We're on PG 16, but the column had a default value that needed verification on every existing row, holding ACCESS EXCLUSIVE lock for ~30 min.
## Why? (5 whys)
1. Why: 502 errors → DB queries timed out
2. Why: orders table held ACCESS EXCLUSIVE lock
3. Why: ALTER TABLE held the lock
4. Why: NOT NULL DEFAULT 변경 가 full rewrite trigger
5. Why: Migration review checklist 가 이 case 안 cover
## What went well
- Alert fired within 2 minutes
- Oncall responded quickly
- Communication on status page was clear
## What went wrong
- Migration not tested on prod-like dataset
- Migration review checklist incomplete
- No "kill switch" for in-progress migration
- 30-min wait — no fast rollback
## Action items
| # | Action | Owner | Due | Status |
|---|--------|-------|-----|--------|
| 1 | Add migration safety checklist (table size > 1M = staged) | @alice | 2026-05-12 | open |
| 2 | Setup migration runbook with kill switch | @bob | 2026-05-15 | open |
| 3 | Use `pg_repack` / `gh-ost` for big table changes | @bob | 2026-05-20 | open |
| 4 | Add canary deploy for migrations | @platform | 2026-06-01 | open |
| 5 | Document this in DB onboarding doc | @alice | 2026-05-12 | open |
## Related
- PR: #1234
- Slack: #incident-2026-05-09
- Status page: https://status.example.com/incidents/abc123
Severity definitions
P1 (Critical): Major user impact, money loss. Immediate response.
P2 (High): Significant impact, partial outage.
P3 (Medium): Minor user impact, workaround exists.
P4 (Low): Minimal impact, internal only.
Blameless 강조
✅ "The migration script lacked safeguards"
❌ "Bob ran a bad migration"
✅ "The review process didn't catch this"
❌ "The reviewer should have caught this"
→ Blame X, system fix.
5 Whys
Problem: API outage.
Why 1: DB locked.
Why 2: ALTER TABLE.
Why 3: Big migration in working hours.
Why 4: No staging on prod-like data.
Why 5: No process for "big migration" classification.
Action: Process for big migration.
Fishbone (큰 incident)
Categories:
- People (training, communication)
- Process (review, runbook)
- Technology (tooling, infra)
- Environment (cloud provider)
→ 각 cat 의 contributing factor list.
Time to detect / mitigate / resolve
TTD (Time to Detect): incident 발생 → alarm 까지 (목표: < 5 min)
TTM (Time to Mitigate): alarm → impact 줄임 (목표: < 30 min)
TTR (Time to Resolve): incident 끝 (목표: < 4 hour P1)
→ 매 incident 측정 + trend.
Communication during incident
1. Status page update (1-2 min after detect)
2. Internal: #incidents Slack (real-time)
3. Customer email (impact 큼)
4. Post-resolution update
5. Postmortem published (1 week)
Status page (statuspage.io / better stack)
Status: investigating / identified / monitoring / resolved
Updates every 30 min during incident
Action items (SMART)
Bad: "Improve migration safety"
Good: "Add prod-data migration test for tables > 1M rows. Owner: Alice. Due: 2026-05-15."
→ 추적 + 측정.
Action items 추적
- Postmortem 의 action items = ticket 자동 생성
- Quarterly review — 완료 % 검토
- 작은 팀: 매 standup 언급
Postmortem culture
1. Every P1/P2 = postmortem.
2. 1 week 안 작성.
3. 팀 review.
4. Action 추적.
5. Public (회사 안) — 다른 팀 학습.
Sharing across teams
- Internal blog post
- Engineering all-hands
- Wiki (Notion / Confluence)
- Tag with category (DB / network / deploy / 등)
Repeated incidents
같은 root cause 가 반복?
→ Action items 가 효과 없음.
→ 더 강 fix 필요.
Post-incident review (인간 학습)
30-60 min 미팅:
1. Timeline 검토
2. Decisions during incident — 옳음? 빠름?
3. What went well / wrong
4. Action items 생성
5. Next steps
→ "How well did we respond?" 가 핵심 질문.
Tools
- PagerDuty / Opsgenie / Grafana Oncall — alerting
- Statuspage.io / Better Stack — public status
- Jeli / Rootly / FireHydrant — incident management
- Internal: Slack channel + Google Doc + Linear ticket
Customer-facing 글
# 2026-05-09 Service disruption
We experienced an outage from 14:23 to 15:47 UTC affecting 35% of users.
**What happened:** A database migration caused unexpected table locking, leading to API timeouts.
**What we're doing:**
- Reviewing migration practices
- Adding additional safeguards
- Implementing canary deploys
We apologize for the disruption. If you experienced lost transactions, please contact support@example.com.
→ 정확 + 사과 + 개선.
Near-miss (almost incident)
큰 outage 안 됐지만 — close call.
Same RCA process 가능 (작게).
"We dodged a bullet — let's prevent next time."
Game day (preventive)
의도적 incident 시뮬레이션:
- DB primary kill
- Random pod kill (Chaos Engineering)
- Network partition
→ Response practice + 발견 weakness.
Postmortem after-action
Postmortem 자체 = action 의 시작.
6 months later:
- Action items 완료?
- Same RCA 다시 발생?
- 새 process 작동?
🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| P1/P2 incident | 항상 postmortem |
| P3/P4 | 선택 — 학습 가치 시 |
| Near miss | 짧은 postmortem |
| Customer impact | + customer message |
| Repeated incident | Deep dive — system level |
❌ 안티패턴
- Blame: 개인 비난 — defensive culture.
- Vague action items: "improve" 같은 — 안 함.
- Action items 추적 X: 같은 incident 다시.
- Postmortem 없음 (작은 incident): 학습 잃음.
- Public 안 share: 같은 issue 다른 팀.
- Timeline 부정확: 추측 X — log + slack 인용.
- 5 whys 가 1 why 에 멈춤: surface fix 만.
🤖 LLM 활용 힌트
- Blameless culture + 5 whys.
- Action items SMART.
- Timeline log 인용.
- 6 months later 재검토.