Files

T

Antigravity Agent 93ec7e9056 [G1-Sync] Manual knowledge update

2026-05-09 21:08:02 +09:00

7.9 KiB

Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases

title

Postmortem

Incident 후 학습 문서. Blameless = 사람 X 시스템 O. Timeline + RCA + 개선 액션. Google SRE / Etsy / Stripe 표준.

📖 핵심 개념

Blameless: 비난 X. System / process 가 잘못.
Timeline: 정확 시간 순.
Root cause: 5 whys / fishbone.
Action items: SMART (Specific, Measurable, Assignable).

💻 코드 패턴

Postmortem template

# Incident: API outage 2026-05-09

**Status:** Resolved
**Severity:** P1 (high)
**Duration:** 14:23 - 15:47 UTC (1h 24min)
**Author:** Alice
**Reviewers:** Bob, Carol

## Summary
새 deploy 의 SQL migration 가 production DB 의 orders table 에 30분 lock — 모든 read query 502.

## Impact
- Users affected: ~12,000 (35% of MAU)
- Failed requests: ~150,000
- Revenue impact: ~$5,000 estimated
- SLO breach: 99.9% → 98.7%

## Timeline (UTC)
- 14:00 — Migration PR merged (CI passed, no review concern)
- 14:23 — Production deploy starts
- 14:23 — Migration begins: `ALTER TABLE orders ADD COLUMN ... NOT NULL`
- 14:25 — DB CPU spike detected (PagerDuty alert)
- 14:26 — Oncall (Bob) acknowledges
- 14:30 — First user report (Twitter)
- 14:35 — Oncall identifies migration as cause
- 14:40 — Decision: cancel migration?
- 14:43 — Cancel attempt fails (lock too deep)
- 14:55 — Wait for completion
- 15:30 — Migration completes
- 15:35 — Service partially restored
- 15:47 — Full recovery confirmed
- 16:00 — Status page resolved

## Root cause
ALTER TABLE on a 50M-row table with NOT NULL DEFAULT triggered full table rewrite in Postgres < 11. We're on PG 16, but the column had a default value that needed verification on every existing row, holding ACCESS EXCLUSIVE lock for ~30 min.

## Why? (5 whys)
1. Why: 502 errors → DB queries timed out
2. Why: orders table held ACCESS EXCLUSIVE lock
3. Why: ALTER TABLE held the lock
4. Why: NOT NULL DEFAULT 변경 가 full rewrite trigger
5. Why: Migration review checklist 가 이 case 안 cover

## What went well
- Alert fired within 2 minutes
- Oncall responded quickly
- Communication on status page was clear

## What went wrong
- Migration not tested on prod-like dataset
- Migration review checklist incomplete
- No "kill switch" for in-progress migration
- 30-min wait — no fast rollback

## Action items
| # | Action | Owner | Due | Status |
|---|--------|-------|-----|--------|
| 1 | Add migration safety checklist (table size > 1M = staged) | @alice | 2026-05-12 | open |
| 2 | Setup migration runbook with kill switch | @bob | 2026-05-15 | open |
| 3 | Use `pg_repack` / `gh-ost` for big table changes | @bob | 2026-05-20 | open |
| 4 | Add canary deploy for migrations | @platform | 2026-06-01 | open |
| 5 | Document this in DB onboarding doc | @alice | 2026-05-12 | open |

## Related
- PR: #1234
- Slack: #incident-2026-05-09
- Status page: https://status.example.com/incidents/abc123

Severity definitions

P1 (Critical):    Major user impact, money loss. Immediate response.
P2 (High):        Significant impact, partial outage.
P3 (Medium):      Minor user impact, workaround exists.
P4 (Low):         Minimal impact, internal only.

Blameless 강조

✅ "The migration script lacked safeguards"
❌ "Bob ran a bad migration"

✅ "The review process didn't catch this"
❌ "The reviewer should have caught this"

→ Blame X, system fix.

5 Whys

Problem: API outage.
Why 1: DB locked.
Why 2: ALTER TABLE.
Why 3: Big migration in working hours.
Why 4: No staging on prod-like data.
Why 5: No process for "big migration" classification.

Action: Process for big migration.

Fishbone (큰 incident)

Categories:
- People (training, communication)
- Process (review, runbook)
- Technology (tooling, infra)
- Environment (cloud provider)

→ 각 cat 의 contributing factor list.

Time to detect / mitigate / resolve

TTD (Time to Detect): incident 발생 → alarm 까지 (목표: < 5 min)
TTM (Time to Mitigate): alarm → impact 줄임 (목표: < 30 min)
TTR (Time to Resolve): incident 끝 (목표: < 4 hour P1)

→ 매 incident 측정 + trend.

Communication during incident

1. Status page update (1-2 min after detect)
2. Internal: #incidents Slack (real-time)
3. Customer email (impact 큼)
4. Post-resolution update
5. Postmortem published (1 week)

Status page (statuspage.io / better stack)

Status: investigating / identified / monitoring / resolved
Updates every 30 min during incident

Action items (SMART)

Bad:  "Improve migration safety"
Good: "Add prod-data migration test for tables > 1M rows. Owner: Alice. Due: 2026-05-15."

→ 추적 + 측정.

Action items 추적

- Postmortem 의 action items = ticket 자동 생성
- Quarterly review — 완료 % 검토
- 작은 팀: 매 standup 언급

Postmortem culture

1. Every P1/P2 = postmortem.
2. 1 week 안 작성.
3. 팀 review.
4. Action 추적.
5. Public (회사 안) — 다른 팀 학습.

- Internal blog post
- Engineering all-hands
- Wiki (Notion / Confluence)
- Tag with category (DB / network / deploy / 등)

Repeated incidents

같은 root cause 가 반복?
→ Action items 가 효과 없음.
→ 더 강 fix 필요.

Post-incident review (인간 학습)

30-60 min 미팅:
1. Timeline 검토
2. Decisions during incident — 옳음? 빠름?
3. What went well / wrong
4. Action items 생성
5. Next steps

→ "How well did we respond?" 가 핵심 질문.

Tools

- PagerDuty / Opsgenie / Grafana Oncall — alerting
- Statuspage.io / Better Stack — public status
- Jeli / Rootly / FireHydrant — incident management
- Internal: Slack channel + Google Doc + Linear ticket

Customer-facing 글

# 2026-05-09 Service disruption

We experienced an outage from 14:23 to 15:47 UTC affecting 35% of users.

**What happened:** A database migration caused unexpected table locking, leading to API timeouts.

**What we're doing:**
- Reviewing migration practices
- Adding additional safeguards
- Implementing canary deploys

We apologize for the disruption. If you experienced lost transactions, please contact support@example.com.

→ 정확 + 사과 + 개선.

Near-miss (almost incident)

큰 outage 안 됐지만 — close call.
Same RCA process 가능 (작게).

"We dodged a bullet — let's prevent next time."

Game day (preventive)

의도적 incident 시뮬레이션:
- DB primary kill
- Random pod kill (Chaos Engineering)
- Network partition

→ Response practice + 발견 weakness.

Postmortem after-action

Postmortem 자체 = action 의 시작.
6 months later:
- Action items 완료?
- Same RCA 다시 발생?
- 새 process 작동?

🤔 의사결정 기준

상황	추천
P1/P2 incident	항상 postmortem
P3/P4	선택 — 학습 가치 시
Near miss	짧은 postmortem
Customer impact	+ customer message
Repeated incident	Deep dive — system level

❌ 안티패턴

Blame: 개인 비난 — defensive culture.
Vague action items: "improve" 같은 — 안 함.
Action items 추적 X: 같은 incident 다시.
Postmortem 없음 (작은 incident): 학습 잃음.
Public 안 share: 같은 issue 다른 팀.
Timeline 부정확: 추측 X — log + slack 인용.
5 whys 가 1 why 에 멈춤: surface fix 만.

🤖 LLM 활용 힌트

Blameless culture + 5 whys.
Action items SMART.
Timeline log 인용.
6 months later 재검토.

7.9 KiB Raw Blame History