[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,296 @@
|
||||
---
|
||||
id: productivity-postmortem
|
||||
title: Postmortem — Blameless / 학습 / 개선
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [productivity, postmortem, incident, vibe-coding]
|
||||
tech_stack: { language: "Process", applicable_to: ["Engineering"] }
|
||||
applied_in: []
|
||||
aliases: [postmortem, blameless, RCA, root cause analysis, incident review, 5 whys]
|
||||
---
|
||||
|
||||
# Postmortem
|
||||
|
||||
> Incident 후 학습 문서. **Blameless = 사람 X 시스템 O**. Timeline + RCA + 개선 액션. Google SRE / Etsy / Stripe 표준.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Blameless: 비난 X. System / process 가 잘못.
|
||||
- Timeline: 정확 시간 순.
|
||||
- Root cause: 5 whys / fishbone.
|
||||
- Action items: SMART (Specific, Measurable, Assignable).
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### Postmortem template
|
||||
```markdown
|
||||
# Incident: API outage 2026-05-09
|
||||
|
||||
**Status:** Resolved
|
||||
**Severity:** P1 (high)
|
||||
**Duration:** 14:23 - 15:47 UTC (1h 24min)
|
||||
**Author:** Alice
|
||||
**Reviewers:** Bob, Carol
|
||||
|
||||
## Summary
|
||||
새 deploy 의 SQL migration 가 production DB 의 orders table 에 30분 lock — 모든 read query 502.
|
||||
|
||||
## Impact
|
||||
- Users affected: ~12,000 (35% of MAU)
|
||||
- Failed requests: ~150,000
|
||||
- Revenue impact: ~$5,000 estimated
|
||||
- SLO breach: 99.9% → 98.7%
|
||||
|
||||
## Timeline (UTC)
|
||||
- 14:00 — Migration PR merged (CI passed, no review concern)
|
||||
- 14:23 — Production deploy starts
|
||||
- 14:23 — Migration begins: `ALTER TABLE orders ADD COLUMN ... NOT NULL`
|
||||
- 14:25 — DB CPU spike detected (PagerDuty alert)
|
||||
- 14:26 — Oncall (Bob) acknowledges
|
||||
- 14:30 — First user report (Twitter)
|
||||
- 14:35 — Oncall identifies migration as cause
|
||||
- 14:40 — Decision: cancel migration?
|
||||
- 14:43 — Cancel attempt fails (lock too deep)
|
||||
- 14:55 — Wait for completion
|
||||
- 15:30 — Migration completes
|
||||
- 15:35 — Service partially restored
|
||||
- 15:47 — Full recovery confirmed
|
||||
- 16:00 — Status page resolved
|
||||
|
||||
## Root cause
|
||||
ALTER TABLE on a 50M-row table with NOT NULL DEFAULT triggered full table rewrite in Postgres < 11. We're on PG 16, but the column had a default value that needed verification on every existing row, holding ACCESS EXCLUSIVE lock for ~30 min.
|
||||
|
||||
## Why? (5 whys)
|
||||
1. Why: 502 errors → DB queries timed out
|
||||
2. Why: orders table held ACCESS EXCLUSIVE lock
|
||||
3. Why: ALTER TABLE held the lock
|
||||
4. Why: NOT NULL DEFAULT 변경 가 full rewrite trigger
|
||||
5. Why: Migration review checklist 가 이 case 안 cover
|
||||
|
||||
## What went well
|
||||
- Alert fired within 2 minutes
|
||||
- Oncall responded quickly
|
||||
- Communication on status page was clear
|
||||
|
||||
## What went wrong
|
||||
- Migration not tested on prod-like dataset
|
||||
- Migration review checklist incomplete
|
||||
- No "kill switch" for in-progress migration
|
||||
- 30-min wait — no fast rollback
|
||||
|
||||
## Action items
|
||||
| # | Action | Owner | Due | Status |
|
||||
|---|--------|-------|-----|--------|
|
||||
| 1 | Add migration safety checklist (table size > 1M = staged) | @alice | 2026-05-12 | open |
|
||||
| 2 | Setup migration runbook with kill switch | @bob | 2026-05-15 | open |
|
||||
| 3 | Use `pg_repack` / `gh-ost` for big table changes | @bob | 2026-05-20 | open |
|
||||
| 4 | Add canary deploy for migrations | @platform | 2026-06-01 | open |
|
||||
| 5 | Document this in DB onboarding doc | @alice | 2026-05-12 | open |
|
||||
|
||||
## Related
|
||||
- PR: #1234
|
||||
- Slack: #incident-2026-05-09
|
||||
- Status page: https://status.example.com/incidents/abc123
|
||||
```
|
||||
|
||||
### Severity definitions
|
||||
```
|
||||
P1 (Critical): Major user impact, money loss. Immediate response.
|
||||
P2 (High): Significant impact, partial outage.
|
||||
P3 (Medium): Minor user impact, workaround exists.
|
||||
P4 (Low): Minimal impact, internal only.
|
||||
```
|
||||
|
||||
### Blameless 강조
|
||||
```
|
||||
✅ "The migration script lacked safeguards"
|
||||
❌ "Bob ran a bad migration"
|
||||
|
||||
✅ "The review process didn't catch this"
|
||||
❌ "The reviewer should have caught this"
|
||||
|
||||
→ Blame X, system fix.
|
||||
```
|
||||
|
||||
### 5 Whys
|
||||
```
|
||||
Problem: API outage.
|
||||
Why 1: DB locked.
|
||||
Why 2: ALTER TABLE.
|
||||
Why 3: Big migration in working hours.
|
||||
Why 4: No staging on prod-like data.
|
||||
Why 5: No process for "big migration" classification.
|
||||
|
||||
Action: Process for big migration.
|
||||
```
|
||||
|
||||
### Fishbone (큰 incident)
|
||||
```
|
||||
Categories:
|
||||
- People (training, communication)
|
||||
- Process (review, runbook)
|
||||
- Technology (tooling, infra)
|
||||
- Environment (cloud provider)
|
||||
|
||||
→ 각 cat 의 contributing factor list.
|
||||
```
|
||||
|
||||
### Time to detect / mitigate / resolve
|
||||
```
|
||||
TTD (Time to Detect): incident 발생 → alarm 까지 (목표: < 5 min)
|
||||
TTM (Time to Mitigate): alarm → impact 줄임 (목표: < 30 min)
|
||||
TTR (Time to Resolve): incident 끝 (목표: < 4 hour P1)
|
||||
|
||||
→ 매 incident 측정 + trend.
|
||||
```
|
||||
|
||||
### Communication during incident
|
||||
```
|
||||
1. Status page update (1-2 min after detect)
|
||||
2. Internal: #incidents Slack (real-time)
|
||||
3. Customer email (impact 큼)
|
||||
4. Post-resolution update
|
||||
5. Postmortem published (1 week)
|
||||
```
|
||||
|
||||
### Status page (statuspage.io / better stack)
|
||||
```
|
||||
Status: investigating / identified / monitoring / resolved
|
||||
Updates every 30 min during incident
|
||||
```
|
||||
|
||||
### Action items (SMART)
|
||||
```
|
||||
Bad: "Improve migration safety"
|
||||
Good: "Add prod-data migration test for tables > 1M rows. Owner: Alice. Due: 2026-05-15."
|
||||
|
||||
→ 추적 + 측정.
|
||||
```
|
||||
|
||||
### Action items 추적
|
||||
```
|
||||
- Postmortem 의 action items = ticket 자동 생성
|
||||
- Quarterly review — 완료 % 검토
|
||||
- 작은 팀: 매 standup 언급
|
||||
```
|
||||
|
||||
### Postmortem culture
|
||||
```
|
||||
1. Every P1/P2 = postmortem.
|
||||
2. 1 week 안 작성.
|
||||
3. 팀 review.
|
||||
4. Action 추적.
|
||||
5. Public (회사 안) — 다른 팀 학습.
|
||||
```
|
||||
|
||||
### Sharing across teams
|
||||
```
|
||||
- Internal blog post
|
||||
- Engineering all-hands
|
||||
- Wiki (Notion / Confluence)
|
||||
- Tag with category (DB / network / deploy / 등)
|
||||
```
|
||||
|
||||
### Repeated incidents
|
||||
```
|
||||
같은 root cause 가 반복?
|
||||
→ Action items 가 효과 없음.
|
||||
→ 더 강 fix 필요.
|
||||
```
|
||||
|
||||
### Post-incident review (인간 학습)
|
||||
```
|
||||
30-60 min 미팅:
|
||||
1. Timeline 검토
|
||||
2. Decisions during incident — 옳음? 빠름?
|
||||
3. What went well / wrong
|
||||
4. Action items 생성
|
||||
5. Next steps
|
||||
```
|
||||
|
||||
→ "How well did we respond?" 가 핵심 질문.
|
||||
|
||||
### Tools
|
||||
```
|
||||
- PagerDuty / Opsgenie / Grafana Oncall — alerting
|
||||
- Statuspage.io / Better Stack — public status
|
||||
- Jeli / Rootly / FireHydrant — incident management
|
||||
- Internal: Slack channel + Google Doc + Linear ticket
|
||||
```
|
||||
|
||||
### Customer-facing 글
|
||||
```markdown
|
||||
# 2026-05-09 Service disruption
|
||||
|
||||
We experienced an outage from 14:23 to 15:47 UTC affecting 35% of users.
|
||||
|
||||
**What happened:** A database migration caused unexpected table locking, leading to API timeouts.
|
||||
|
||||
**What we're doing:**
|
||||
- Reviewing migration practices
|
||||
- Adding additional safeguards
|
||||
- Implementing canary deploys
|
||||
|
||||
We apologize for the disruption. If you experienced lost transactions, please contact support@example.com.
|
||||
```
|
||||
|
||||
→ 정확 + 사과 + 개선.
|
||||
|
||||
### Near-miss (almost incident)
|
||||
```
|
||||
큰 outage 안 됐지만 — close call.
|
||||
Same RCA process 가능 (작게).
|
||||
|
||||
"We dodged a bullet — let's prevent next time."
|
||||
```
|
||||
|
||||
### Game day (preventive)
|
||||
```
|
||||
의도적 incident 시뮬레이션:
|
||||
- DB primary kill
|
||||
- Random pod kill (Chaos Engineering)
|
||||
- Network partition
|
||||
|
||||
→ Response practice + 발견 weakness.
|
||||
```
|
||||
|
||||
### Postmortem after-action
|
||||
```
|
||||
Postmortem 자체 = action 의 시작.
|
||||
6 months later:
|
||||
- Action items 완료?
|
||||
- Same RCA 다시 발생?
|
||||
- 새 process 작동?
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 상황 | 추천 |
|
||||
|---|---|
|
||||
| P1/P2 incident | 항상 postmortem |
|
||||
| P3/P4 | 선택 — 학습 가치 시 |
|
||||
| Near miss | 짧은 postmortem |
|
||||
| Customer impact | + customer message |
|
||||
| Repeated incident | Deep dive — system level |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **Blame**: 개인 비난 — defensive culture.
|
||||
- **Vague action items**: "improve" 같은 — 안 함.
|
||||
- **Action items 추적 X**: 같은 incident 다시.
|
||||
- **Postmortem 없음 (작은 incident)**: 학습 잃음.
|
||||
- **Public 안 share**: 같은 issue 다른 팀.
|
||||
- **Timeline 부정확**: 추측 X — log + slack 인용.
|
||||
- **5 whys 가 1 why 에 멈춤**: surface fix 만.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- Blameless culture + 5 whys.
|
||||
- Action items SMART.
|
||||
- Timeline log 인용.
|
||||
- 6 months later 재검토.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[Productivity_Oncall_Playbook]]
|
||||
- [[DevOps_Disaster_Recovery]]
|
||||
- [[Productivity_Code_Review]]
|
||||
Reference in New Issue
Block a user