--- id: productivity-postmortem title: Postmortem — Blameless / 학습 / 개선 category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [productivity, postmortem, incident, vibe-coding] tech_stack: { language: "Process", applicable_to: ["Engineering"] } applied_in: [] aliases: [postmortem, blameless, RCA, root cause analysis, incident review, 5 whys] --- # Postmortem > Incident 후 학습 문서. **Blameless = 사람 X 시스템 O**. Timeline + RCA + 개선 액션. Google SRE / Etsy / Stripe 표준. ## 📖 핵심 개념 - Blameless: 비난 X. System / process 가 잘못. - Timeline: 정확 시간 순. - Root cause: 5 whys / fishbone. - Action items: SMART (Specific, Measurable, Assignable). ## 💻 코드 패턴 ### Postmortem template ```markdown # Incident: API outage 2026-05-09 **Status:** Resolved **Severity:** P1 (high) **Duration:** 14:23 - 15:47 UTC (1h 24min) **Author:** Alice **Reviewers:** Bob, Carol ## Summary 새 deploy 의 SQL migration 가 production DB 의 orders table 에 30분 lock — 모든 read query 502. ## Impact - Users affected: ~12,000 (35% of MAU) - Failed requests: ~150,000 - Revenue impact: ~$5,000 estimated - SLO breach: 99.9% → 98.7% ## Timeline (UTC) - 14:00 — Migration PR merged (CI passed, no review concern) - 14:23 — Production deploy starts - 14:23 — Migration begins: `ALTER TABLE orders ADD COLUMN ... NOT NULL` - 14:25 — DB CPU spike detected (PagerDuty alert) - 14:26 — Oncall (Bob) acknowledges - 14:30 — First user report (Twitter) - 14:35 — Oncall identifies migration as cause - 14:40 — Decision: cancel migration? - 14:43 — Cancel attempt fails (lock too deep) - 14:55 — Wait for completion - 15:30 — Migration completes - 15:35 — Service partially restored - 15:47 — Full recovery confirmed - 16:00 — Status page resolved ## Root cause ALTER TABLE on a 50M-row table with NOT NULL DEFAULT triggered full table rewrite in Postgres < 11. We're on PG 16, but the column had a default value that needed verification on every existing row, holding ACCESS EXCLUSIVE lock for ~30 min. ## Why? (5 whys) 1. Why: 502 errors → DB queries timed out 2. Why: orders table held ACCESS EXCLUSIVE lock 3. Why: ALTER TABLE held the lock 4. Why: NOT NULL DEFAULT 변경 가 full rewrite trigger 5. Why: Migration review checklist 가 이 case 안 cover ## What went well - Alert fired within 2 minutes - Oncall responded quickly - Communication on status page was clear ## What went wrong - Migration not tested on prod-like dataset - Migration review checklist incomplete - No "kill switch" for in-progress migration - 30-min wait — no fast rollback ## Action items | # | Action | Owner | Due | Status | |---|--------|-------|-----|--------| | 1 | Add migration safety checklist (table size > 1M = staged) | @alice | 2026-05-12 | open | | 2 | Setup migration runbook with kill switch | @bob | 2026-05-15 | open | | 3 | Use `pg_repack` / `gh-ost` for big table changes | @bob | 2026-05-20 | open | | 4 | Add canary deploy for migrations | @platform | 2026-06-01 | open | | 5 | Document this in DB onboarding doc | @alice | 2026-05-12 | open | ## Related - PR: #1234 - Slack: #incident-2026-05-09 - Status page: https://status.example.com/incidents/abc123 ``` ### Severity definitions ``` P1 (Critical): Major user impact, money loss. Immediate response. P2 (High): Significant impact, partial outage. P3 (Medium): Minor user impact, workaround exists. P4 (Low): Minimal impact, internal only. ``` ### Blameless 강조 ``` ✅ "The migration script lacked safeguards" ❌ "Bob ran a bad migration" ✅ "The review process didn't catch this" ❌ "The reviewer should have caught this" → Blame X, system fix. ``` ### 5 Whys ``` Problem: API outage. Why 1: DB locked. Why 2: ALTER TABLE. Why 3: Big migration in working hours. Why 4: No staging on prod-like data. Why 5: No process for "big migration" classification. Action: Process for big migration. ``` ### Fishbone (큰 incident) ``` Categories: - People (training, communication) - Process (review, runbook) - Technology (tooling, infra) - Environment (cloud provider) → 각 cat 의 contributing factor list. ``` ### Time to detect / mitigate / resolve ``` TTD (Time to Detect): incident 발생 → alarm 까지 (목표: < 5 min) TTM (Time to Mitigate): alarm → impact 줄임 (목표: < 30 min) TTR (Time to Resolve): incident 끝 (목표: < 4 hour P1) → 매 incident 측정 + trend. ``` ### Communication during incident ``` 1. Status page update (1-2 min after detect) 2. Internal: #incidents Slack (real-time) 3. Customer email (impact 큼) 4. Post-resolution update 5. Postmortem published (1 week) ``` ### Status page (statuspage.io / better stack) ``` Status: investigating / identified / monitoring / resolved Updates every 30 min during incident ``` ### Action items (SMART) ``` Bad: "Improve migration safety" Good: "Add prod-data migration test for tables > 1M rows. Owner: Alice. Due: 2026-05-15." → 추적 + 측정. ``` ### Action items 추적 ``` - Postmortem 의 action items = ticket 자동 생성 - Quarterly review — 완료 % 검토 - 작은 팀: 매 standup 언급 ``` ### Postmortem culture ``` 1. Every P1/P2 = postmortem. 2. 1 week 안 작성. 3. 팀 review. 4. Action 추적. 5. Public (회사 안) — 다른 팀 학습. ``` ### Sharing across teams ``` - Internal blog post - Engineering all-hands - Wiki (Notion / Confluence) - Tag with category (DB / network / deploy / 등) ``` ### Repeated incidents ``` 같은 root cause 가 반복? → Action items 가 효과 없음. → 더 강 fix 필요. ``` ### Post-incident review (인간 학습) ``` 30-60 min 미팅: 1. Timeline 검토 2. Decisions during incident — 옳음? 빠름? 3. What went well / wrong 4. Action items 생성 5. Next steps ``` → "How well did we respond?" 가 핵심 질문. ### Tools ``` - PagerDuty / Opsgenie / Grafana Oncall — alerting - Statuspage.io / Better Stack — public status - Jeli / Rootly / FireHydrant — incident management - Internal: Slack channel + Google Doc + Linear ticket ``` ### Customer-facing 글 ```markdown # 2026-05-09 Service disruption We experienced an outage from 14:23 to 15:47 UTC affecting 35% of users. **What happened:** A database migration caused unexpected table locking, leading to API timeouts. **What we're doing:** - Reviewing migration practices - Adding additional safeguards - Implementing canary deploys We apologize for the disruption. If you experienced lost transactions, please contact support@example.com. ``` → 정확 + 사과 + 개선. ### Near-miss (almost incident) ``` 큰 outage 안 됐지만 — close call. Same RCA process 가능 (작게). "We dodged a bullet — let's prevent next time." ``` ### Game day (preventive) ``` 의도적 incident 시뮬레이션: - DB primary kill - Random pod kill (Chaos Engineering) - Network partition → Response practice + 발견 weakness. ``` ### Postmortem after-action ``` Postmortem 자체 = action 의 시작. 6 months later: - Action items 완료? - Same RCA 다시 발생? - 새 process 작동? ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | P1/P2 incident | 항상 postmortem | | P3/P4 | 선택 — 학습 가치 시 | | Near miss | 짧은 postmortem | | Customer impact | + customer message | | Repeated incident | Deep dive — system level | ## ❌ 안티패턴 - **Blame**: 개인 비난 — defensive culture. - **Vague action items**: "improve" 같은 — 안 함. - **Action items 추적 X**: 같은 incident 다시. - **Postmortem 없음 (작은 incident)**: 학습 잃음. - **Public 안 share**: 같은 issue 다른 팀. - **Timeline 부정확**: 추측 X — log + slack 인용. - **5 whys 가 1 why 에 멈춤**: surface fix 만. ## 🤖 LLM 활용 힌트 - Blameless culture + 5 whys. - Action items SMART. - Timeline log 인용. - 6 months later 재검토. ## 🔗 관련 문서 - [[Productivity_Oncall_Playbook]] - [[DevOps_Disaster_Recovery]] - [[Productivity_Code_Review]]