[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,296 @@
---
id: productivity-postmortem
title: Postmortem — Blameless / 학습 / 개선
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [productivity, postmortem, incident, vibe-coding]
tech_stack: { language: "Process", applicable_to: ["Engineering"] }
applied_in: []
aliases: [postmortem, blameless, RCA, root cause analysis, incident review, 5 whys]
---
# Postmortem
> Incident 후 학습 문서. **Blameless = 사람 X 시스템 O**. Timeline + RCA + 개선 액션. Google SRE / Etsy / Stripe 표준.
## 📖 핵심 개념
- Blameless: 비난 X. System / process 가 잘못.
- Timeline: 정확 시간 순.
- Root cause: 5 whys / fishbone.
- Action items: SMART (Specific, Measurable, Assignable).
## 💻 코드 패턴
### Postmortem template
```markdown
# Incident: API outage 2026-05-09
**Status:** Resolved
**Severity:** P1 (high)
**Duration:** 14:23 - 15:47 UTC (1h 24min)
**Author:** Alice
**Reviewers:** Bob, Carol
## Summary
새 deploy 의 SQL migration 가 production DB 의 orders table 에 30분 lock — 모든 read query 502.
## Impact
- Users affected: ~12,000 (35% of MAU)
- Failed requests: ~150,000
- Revenue impact: ~$5,000 estimated
- SLO breach: 99.9% → 98.7%
## Timeline (UTC)
- 14:00 — Migration PR merged (CI passed, no review concern)
- 14:23 — Production deploy starts
- 14:23 — Migration begins: `ALTER TABLE orders ADD COLUMN ... NOT NULL`
- 14:25 — DB CPU spike detected (PagerDuty alert)
- 14:26 — Oncall (Bob) acknowledges
- 14:30 — First user report (Twitter)
- 14:35 — Oncall identifies migration as cause
- 14:40 — Decision: cancel migration?
- 14:43 — Cancel attempt fails (lock too deep)
- 14:55 — Wait for completion
- 15:30 — Migration completes
- 15:35 — Service partially restored
- 15:47 — Full recovery confirmed
- 16:00 — Status page resolved
## Root cause
ALTER TABLE on a 50M-row table with NOT NULL DEFAULT triggered full table rewrite in Postgres < 11. We're on PG 16, but the column had a default value that needed verification on every existing row, holding ACCESS EXCLUSIVE lock for ~30 min.
## Why? (5 whys)
1. Why: 502 errors → DB queries timed out
2. Why: orders table held ACCESS EXCLUSIVE lock
3. Why: ALTER TABLE held the lock
4. Why: NOT NULL DEFAULT 변경 가 full rewrite trigger
5. Why: Migration review checklist 가 이 case 안 cover
## What went well
- Alert fired within 2 minutes
- Oncall responded quickly
- Communication on status page was clear
## What went wrong
- Migration not tested on prod-like dataset
- Migration review checklist incomplete
- No "kill switch" for in-progress migration
- 30-min wait — no fast rollback
## Action items
| # | Action | Owner | Due | Status |
|---|--------|-------|-----|--------|
| 1 | Add migration safety checklist (table size > 1M = staged) | @alice | 2026-05-12 | open |
| 2 | Setup migration runbook with kill switch | @bob | 2026-05-15 | open |
| 3 | Use `pg_repack` / `gh-ost` for big table changes | @bob | 2026-05-20 | open |
| 4 | Add canary deploy for migrations | @platform | 2026-06-01 | open |
| 5 | Document this in DB onboarding doc | @alice | 2026-05-12 | open |
## Related
- PR: #1234
- Slack: #incident-2026-05-09
- Status page: https://status.example.com/incidents/abc123
```
### Severity definitions
```
P1 (Critical): Major user impact, money loss. Immediate response.
P2 (High): Significant impact, partial outage.
P3 (Medium): Minor user impact, workaround exists.
P4 (Low): Minimal impact, internal only.
```
### Blameless 강조
```
✅ "The migration script lacked safeguards"
❌ "Bob ran a bad migration"
✅ "The review process didn't catch this"
❌ "The reviewer should have caught this"
→ Blame X, system fix.
```
### 5 Whys
```
Problem: API outage.
Why 1: DB locked.
Why 2: ALTER TABLE.
Why 3: Big migration in working hours.
Why 4: No staging on prod-like data.
Why 5: No process for "big migration" classification.
Action: Process for big migration.
```
### Fishbone (큰 incident)
```
Categories:
- People (training, communication)
- Process (review, runbook)
- Technology (tooling, infra)
- Environment (cloud provider)
→ 각 cat 의 contributing factor list.
```
### Time to detect / mitigate / resolve
```
TTD (Time to Detect): incident 발생 → alarm 까지 (목표: < 5 min)
TTM (Time to Mitigate): alarm → impact 줄임 (목표: < 30 min)
TTR (Time to Resolve): incident 끝 (목표: < 4 hour P1)
→ 매 incident 측정 + trend.
```
### Communication during incident
```
1. Status page update (1-2 min after detect)
2. Internal: #incidents Slack (real-time)
3. Customer email (impact 큼)
4. Post-resolution update
5. Postmortem published (1 week)
```
### Status page (statuspage.io / better stack)
```
Status: investigating / identified / monitoring / resolved
Updates every 30 min during incident
```
### Action items (SMART)
```
Bad: "Improve migration safety"
Good: "Add prod-data migration test for tables > 1M rows. Owner: Alice. Due: 2026-05-15."
→ 추적 + 측정.
```
### Action items 추적
```
- Postmortem 의 action items = ticket 자동 생성
- Quarterly review — 완료 % 검토
- 작은 팀: 매 standup 언급
```
### Postmortem culture
```
1. Every P1/P2 = postmortem.
2. 1 week 안 작성.
3. 팀 review.
4. Action 추적.
5. Public (회사 안) — 다른 팀 학습.
```
### Sharing across teams
```
- Internal blog post
- Engineering all-hands
- Wiki (Notion / Confluence)
- Tag with category (DB / network / deploy / 등)
```
### Repeated incidents
```
같은 root cause 가 반복?
→ Action items 가 효과 없음.
→ 더 강 fix 필요.
```
### Post-incident review (인간 학습)
```
30-60 min 미팅:
1. Timeline 검토
2. Decisions during incident — 옳음? 빠름?
3. What went well / wrong
4. Action items 생성
5. Next steps
```
→ "How well did we respond?" 가 핵심 질문.
### Tools
```
- PagerDuty / Opsgenie / Grafana Oncall — alerting
- Statuspage.io / Better Stack — public status
- Jeli / Rootly / FireHydrant — incident management
- Internal: Slack channel + Google Doc + Linear ticket
```
### Customer-facing 글
```markdown
# 2026-05-09 Service disruption
We experienced an outage from 14:23 to 15:47 UTC affecting 35% of users.
**What happened:** A database migration caused unexpected table locking, leading to API timeouts.
**What we're doing:**
- Reviewing migration practices
- Adding additional safeguards
- Implementing canary deploys
We apologize for the disruption. If you experienced lost transactions, please contact support@example.com.
```
→ 정확 + 사과 + 개선.
### Near-miss (almost incident)
```
큰 outage 안 됐지만 — close call.
Same RCA process 가능 (작게).
"We dodged a bullet — let's prevent next time."
```
### Game day (preventive)
```
의도적 incident 시뮬레이션:
- DB primary kill
- Random pod kill (Chaos Engineering)
- Network partition
→ Response practice + 발견 weakness.
```
### Postmortem after-action
```
Postmortem 자체 = action 의 시작.
6 months later:
- Action items 완료?
- Same RCA 다시 발생?
- 새 process 작동?
```
## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| P1/P2 incident | 항상 postmortem |
| P3/P4 | 선택 — 학습 가치 시 |
| Near miss | 짧은 postmortem |
| Customer impact | + customer message |
| Repeated incident | Deep dive — system level |
## ❌ 안티패턴
- **Blame**: 개인 비난 — defensive culture.
- **Vague action items**: "improve" 같은 — 안 함.
- **Action items 추적 X**: 같은 incident 다시.
- **Postmortem 없음 (작은 incident)**: 학습 잃음.
- **Public 안 share**: 같은 issue 다른 팀.
- **Timeline 부정확**: 추측 X — log + slack 인용.
- **5 whys 가 1 why 에 멈춤**: surface fix 만.
## 🤖 LLM 활용 힌트
- Blameless culture + 5 whys.
- Action items SMART.
- Timeline log 인용.
- 6 months later 재검토.
## 🔗 관련 문서
- [[Productivity_Oncall_Playbook]]
- [[DevOps_Disaster_Recovery]]
- [[Productivity_Code_Review]]