[G1-Sync] Manual knowledge update

2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,296 @@
+---
+id: productivity-postmortem
+title: Postmortem — Blameless / 학습 / 개선
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [productivity, postmortem, incident, vibe-coding]
+tech_stack: { language: "Process", applicable_to: ["Engineering"] }
+applied_in: []
+aliases: [postmortem, blameless, RCA, root cause analysis, incident review, 5 whys]
+---
+
+# Postmortem
+
+> Incident 후 학습 문서. **Blameless = 사람 X 시스템 O**. Timeline + RCA + 개선 액션. Google SRE / Etsy / Stripe 표준.
+
+## 📖 핵심 개념
+- Blameless: 비난 X. System / process 가 잘못.
+- Timeline: 정확 시간 순.
+- Root cause: 5 whys / fishbone.
+- Action items: SMART (Specific, Measurable, Assignable).
+
+## 💻 코드 패턴
+
+### Postmortem template
+```markdown
+# Incident: API outage 2026-05-09
+
+**Status:** Resolved
+**Severity:** P1 (high)
+**Duration:** 14:23 - 15:47 UTC (1h 24min)
+**Author:** Alice
+**Reviewers:** Bob, Carol
+
+## Summary
+새 deploy 의 SQL migration 가 production DB 의 orders table 에 30분 lock — 모든 read query 502.
+
+## Impact
+- Users affected: ~12,000 (35% of MAU)
+- Failed requests: ~150,000
+- Revenue impact: ~$5,000 estimated
+- SLO breach: 99.9% → 98.7%
+
+## Timeline (UTC)
+- 14:00 — Migration PR merged (CI passed, no review concern)
+- 14:23 — Production deploy starts
+- 14:23 — Migration begins: `ALTER TABLE orders ADD COLUMN ... NOT NULL`
+- 14:25 — DB CPU spike detected (PagerDuty alert)
+- 14:26 — Oncall (Bob) acknowledges
+- 14:30 — First user report (Twitter)
+- 14:35 — Oncall identifies migration as cause
+- 14:40 — Decision: cancel migration?
+- 14:43 — Cancel attempt fails (lock too deep)
+- 14:55 — Wait for completion
+- 15:30 — Migration completes
+- 15:35 — Service partially restored
+- 15:47 — Full recovery confirmed
+- 16:00 — Status page resolved
+
+## Root cause
+ALTER TABLE on a 50M-row table with NOT NULL DEFAULT triggered full table rewrite in Postgres < 11. We're on PG 16, but the column had a default value that needed verification on every existing row, holding ACCESS EXCLUSIVE lock for ~30 min.
+
+## Why? (5 whys)
+1. Why: 502 errors → DB queries timed out
+2. Why: orders table held ACCESS EXCLUSIVE lock
+3. Why: ALTER TABLE held the lock
+4. Why: NOT NULL DEFAULT 변경 가 full rewrite trigger
+5. Why: Migration review checklist 가 이 case 안 cover
+
+## What went well
+- Alert fired within 2 minutes
+- Oncall responded quickly
+- Communication on status page was clear
+
+## What went wrong
+- Migration not tested on prod-like dataset
+- Migration review checklist incomplete
+- No "kill switch" for in-progress migration
+- 30-min wait — no fast rollback
+
+## Action items
+| # | Action | Owner | Due | Status |
+|---|--------|-------|-----|--------|
+| 1 | Add migration safety checklist (table size > 1M = staged) | @alice | 2026-05-12 | open |
+| 2 | Setup migration runbook with kill switch | @bob | 2026-05-15 | open |
+| 3 | Use `pg_repack` / `gh-ost` for big table changes | @bob | 2026-05-20 | open |
+| 4 | Add canary deploy for migrations | @platform | 2026-06-01 | open |
+| 5 | Document this in DB onboarding doc | @alice | 2026-05-12 | open |
+
+## Related
+- PR: #1234
+- Slack: #incident-2026-05-09
+- Status page: https://status.example.com/incidents/abc123
+```
+
+### Severity definitions
+```
+P1 (Critical):    Major user impact, money loss. Immediate response.
+P2 (High):        Significant impact, partial outage.
+P3 (Medium):      Minor user impact, workaround exists.
+P4 (Low):         Minimal impact, internal only.
+```
+
+### Blameless 강조
+```
+✅ "The migration script lacked safeguards"
+❌ "Bob ran a bad migration"
+
+✅ "The review process didn't catch this"
+❌ "The reviewer should have caught this"
+
+→ Blame X, system fix.
+```
+
+### 5 Whys
+```
+Problem: API outage.
+Why 1: DB locked.
+Why 2: ALTER TABLE.
+Why 3: Big migration in working hours.
+Why 4: No staging on prod-like data.
+Why 5: No process for "big migration" classification.
+
+Action: Process for big migration.
+```
+
+### Fishbone (큰 incident)
+```
+Categories:
+- People (training, communication)
+- Process (review, runbook)
+- Technology (tooling, infra)
+- Environment (cloud provider)
+
+→ 각 cat 의 contributing factor list.
+```
+
+### Time to detect / mitigate / resolve
+```
+TTD (Time to Detect): incident 발생 → alarm 까지 (목표: < 5 min)
+TTM (Time to Mitigate): alarm → impact 줄임 (목표: < 30 min)
+TTR (Time to Resolve): incident 끝 (목표: < 4 hour P1)
+
+→ 매 incident 측정 + trend.
+```
+
+### Communication during incident
+```
+1. Status page update (1-2 min after detect)
+2. Internal: #incidents Slack (real-time)
+3. Customer email (impact 큼)
+4. Post-resolution update
+5. Postmortem published (1 week)
+```
+
+### Status page (statuspage.io / better stack)
+```
+Status: investigating / identified / monitoring / resolved
+Updates every 30 min during incident
+```
+
+### Action items (SMART)
+```
+Bad:  "Improve migration safety"
+Good: "Add prod-data migration test for tables > 1M rows. Owner: Alice. Due: 2026-05-15."
+
+→ 추적 + 측정.
+```
+
+### Action items 추적
+```
+- Postmortem 의 action items = ticket 자동 생성
+- Quarterly review — 완료 % 검토
+- 작은 팀: 매 standup 언급
+```
+
+### Postmortem culture
+```
+1. Every P1/P2 = postmortem.
+2. 1 week 안 작성.
+3. 팀 review.
+4. Action 추적.
+5. Public (회사 안) — 다른 팀 학습.
+```
+
+### Sharing across teams
+```
+- Internal blog post
+- Engineering all-hands
+- Wiki (Notion / Confluence)
+- Tag with category (DB / network / deploy / 등)
+```
+
+### Repeated incidents
+```
+같은 root cause 가 반복?
+→ Action items 가 효과 없음.
+→ 더 강 fix 필요.
+```
+
+### Post-incident review (인간 학습)
+```
+30-60 min 미팅:
+1. Timeline 검토
+2. Decisions during incident — 옳음? 빠름?
+3. What went well / wrong
+4. Action items 생성
+5. Next steps
+```
+
+→ "How well did we respond?" 가 핵심 질문.
+
+### Tools
+```
+- PagerDuty / Opsgenie / Grafana Oncall — alerting
+- Statuspage.io / Better Stack — public status
+- Jeli / Rootly / FireHydrant — incident management
+- Internal: Slack channel + Google Doc + Linear ticket
+```
+
+### Customer-facing 글
+```markdown
+# 2026-05-09 Service disruption
+
+We experienced an outage from 14:23 to 15:47 UTC affecting 35% of users.
+
+**What happened:** A database migration caused unexpected table locking, leading to API timeouts.
+
+**What we're doing:**
+- Reviewing migration practices
+- Adding additional safeguards
+- Implementing canary deploys
+
+We apologize for the disruption. If you experienced lost transactions, please contact support@example.com.
+```
+
+→ 정확 + 사과 + 개선.
+
+### Near-miss (almost incident)
+```
+큰 outage 안 됐지만 — close call.
+Same RCA process 가능 (작게).
+
+"We dodged a bullet — let's prevent next time."
+```
+
+### Game day (preventive)
+```
+의도적 incident 시뮬레이션:
+- DB primary kill
+- Random pod kill (Chaos Engineering)
+- Network partition
+
+→ Response practice + 발견 weakness.
+```
+
+### Postmortem after-action
+```
+Postmortem 자체 = action 의 시작.
+6 months later:
+- Action items 완료?
+- Same RCA 다시 발생?
+- 새 process 작동?
+```
+
+## 🤔 의사결정 기준
+| 상황 | 추천 |
+|---|---|
+| P1/P2 incident | 항상 postmortem |
+| P3/P4 | 선택 — 학습 가치 시 |
+| Near miss | 짧은 postmortem |
+| Customer impact | + customer message |
+| Repeated incident | Deep dive — system level |
+
+## ❌ 안티패턴
+- **Blame**: 개인 비난 — defensive culture.
+- **Vague action items**: "improve" 같은 — 안 함.
+- **Action items 추적 X**: 같은 incident 다시.
+- **Postmortem 없음 (작은 incident)**: 학습 잃음.
+- **Public 안 share**: 같은 issue 다른 팀.
+- **Timeline 부정확**: 추측 X — log + slack 인용.
+- **5 whys 가 1 why 에 멈춤**: surface fix 만.
+
+## 🤖 LLM 활용 힌트
+- Blameless culture + 5 whys.
+- Action items SMART.
+- Timeline log 인용.
+- 6 months later 재검토.
+
+## 🔗 관련 문서
+- [[Productivity_Oncall_Playbook]]
+- [[DevOps_Disaster_Recovery]]
+- [[Productivity_Code_Review]]