[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,385 @@
|
||||
---
|
||||
id: productivity-migration-runbook
|
||||
title: Migration Runbook — Plan / Verify / Rollback
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [productivity, migration, runbook, vibe-coding]
|
||||
tech_stack: { language: "Process", applicable_to: ["Engineering"] }
|
||||
applied_in: []
|
||||
aliases: [migration runbook, deploy plan, rollback plan, schema migration, change management]
|
||||
---
|
||||
|
||||
# Migration Runbook
|
||||
|
||||
> Critical change = 사전 plan + verify + rollback. **Phase 별 small step + checklist + rollback at each step**. DB / config / feature 모두.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Phase: 작은 step.
|
||||
- Verify: 매 step 후 확인.
|
||||
- Rollback: 매 step 의 reverse.
|
||||
- Comms: 누가 / 언제 / 무엇.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### Template
|
||||
```markdown
|
||||
# Migration: Add `status` column to orders table
|
||||
|
||||
**Owner:** @alice
|
||||
**Date:** 2026-05-12 02:00 UTC
|
||||
**Estimated duration:** 30 min
|
||||
**Severity:** Medium (read-only mode required)
|
||||
**Reviewers:** @bob (DB), @carol (oncall)
|
||||
|
||||
## Goal
|
||||
Add `status ENUM` column to `orders` table for new state machine feature.
|
||||
|
||||
## Background
|
||||
- Spec: <link>
|
||||
- Related PRs: #123, #124, #125
|
||||
|
||||
## Pre-checks
|
||||
- [ ] Production backup latest snapshot taken (auto)
|
||||
- [ ] Migration tested on staging (yes, ran 25 min)
|
||||
- [ ] Rollback script ready
|
||||
- [ ] Status page event scheduled
|
||||
- [ ] On-call notified
|
||||
- [ ] Engineering team notified in #engineering
|
||||
- [ ] Customer comms (email 24h ago)
|
||||
|
||||
## Steps
|
||||
|
||||
### Phase 1: Add nullable column (00:00 - 00:05)
|
||||
```sql
|
||||
-- Online safe — metadata only on PG 11+
|
||||
ALTER TABLE orders ADD COLUMN status TEXT;
|
||||
```
|
||||
**Verify:** `\d orders` shows new column.
|
||||
**Rollback:** `ALTER TABLE orders DROP COLUMN status;`
|
||||
|
||||
### Phase 2: Backfill (00:05 - 00:25)
|
||||
Backfill existing rows in batches:
|
||||
```sql
|
||||
DO $$
|
||||
DECLARE
|
||||
batch_size INT := 10000;
|
||||
BEGIN
|
||||
LOOP
|
||||
UPDATE orders SET status = 'paid'
|
||||
WHERE status IS NULL AND id IN (
|
||||
SELECT id FROM orders WHERE status IS NULL LIMIT batch_size
|
||||
);
|
||||
EXIT WHEN NOT FOUND;
|
||||
PERFORM pg_sleep(0.5); -- 다른 query 영향 줄임
|
||||
END LOOP;
|
||||
END $$;
|
||||
```
|
||||
**Verify:** `SELECT COUNT(*) FROM orders WHERE status IS NULL;` = 0
|
||||
**Rollback:** N/A (안전)
|
||||
|
||||
### Phase 3: Add NOT NULL (00:25 - 00:27)
|
||||
```sql
|
||||
-- Read-only mode 시작 (3 min)
|
||||
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
|
||||
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
|
||||
```
|
||||
**Verify:** Column NOT NULL.
|
||||
**Rollback:** Drop NOT NULL constraint.
|
||||
|
||||
### Phase 4: Deploy app code (00:27 - 00:30)
|
||||
- App reads `status` column (existing 코드 fallback to 'paid' if NULL)
|
||||
- Deploy via standard CI/CD.
|
||||
**Verify:** Test endpoint works.
|
||||
**Rollback:** Revert deploy.
|
||||
|
||||
## Post-checks
|
||||
- [ ] All endpoints respond < 200ms
|
||||
- [ ] Error rate < 1%
|
||||
- [ ] No DB lock contention (`pg_stat_activity`)
|
||||
- [ ] Customer-facing test scenarios pass
|
||||
|
||||
## Communication
|
||||
- **T-24h:** Email customers about scheduled maintenance
|
||||
- **T-30min:** Slack #engineering announcement
|
||||
- **T-0:** Status page event start
|
||||
- **T+30:** Status page event end + summary
|
||||
|
||||
## Risk assessment
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|------------|--------|------------|
|
||||
| Migration timeout | Low | High | Backfill in small batches |
|
||||
| Lock contention | Medium | High | Read-only mode for Phase 3 |
|
||||
| Backfill long | Medium | Low | Sleep 0.5s between batches |
|
||||
| App code break | Low | Medium | Fallback for NULL |
|
||||
|
||||
## Abort criteria
|
||||
- Phase 2 가 1 hour 이상 걸리면 → continue (backfill 안전)
|
||||
- Phase 3 가 5 min 이상 lock 시 → kill + investigate
|
||||
- Phase 4 deploy 실패 시 → revert + investigate
|
||||
|
||||
## Sign-off
|
||||
- [ ] @alice (Author)
|
||||
- [ ] @bob (DB)
|
||||
- [ ] @carol (Oncall)
|
||||
- [ ] @dave (PM)
|
||||
|
||||
## Post-mortem (if incident)
|
||||
[link]
|
||||
```
|
||||
|
||||
### Phase 별 작은 step
|
||||
```
|
||||
큰 PR 한 번 X.
|
||||
작은 reversible step 다중.
|
||||
|
||||
각 phase:
|
||||
1. Action
|
||||
2. Verify
|
||||
3. Rollback (if needed)
|
||||
```
|
||||
|
||||
### DB schema migration 패턴
|
||||
```
|
||||
1. Add nullable column (deploy)
|
||||
2. Backfill (background)
|
||||
3. App writes new + old (deploy)
|
||||
4. App reads from new (deploy)
|
||||
5. Drop old column (deploy)
|
||||
|
||||
→ N 단계 deploy. 매 단계 reversible.
|
||||
```
|
||||
|
||||
→ Expand-contract pattern.
|
||||
|
||||
### Online migration tools
|
||||
```
|
||||
gh-ost (GitHub): MySQL online schema change
|
||||
pt-online-schema-change (Percona): MySQL
|
||||
pg_repack: Postgres table rewrite
|
||||
pg_squeeze: Postgres bloat
|
||||
Citus: Postgres partition
|
||||
```
|
||||
|
||||
→ Lock 없이 큰 table 변경.
|
||||
|
||||
### Backfill 전략
|
||||
```sql
|
||||
-- Batch + sleep
|
||||
DO $$
|
||||
BEGIN
|
||||
LOOP
|
||||
UPDATE table SET col = ... WHERE id BETWEEN $low AND $high;
|
||||
EXIT WHEN $high >= max_id;
|
||||
PERFORM pg_sleep(0.1);
|
||||
END LOOP;
|
||||
END $$;
|
||||
```
|
||||
|
||||
```ts
|
||||
// App-level batch
|
||||
async function backfill() {
|
||||
let lastId = 0;
|
||||
while (true) {
|
||||
const batch = await db.execute(
|
||||
'UPDATE table SET col = ? WHERE id > ? ORDER BY id LIMIT 10000',
|
||||
[value, lastId]
|
||||
);
|
||||
if (batch.affectedRows === 0) break;
|
||||
lastId = batch.lastId;
|
||||
await sleep(500);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
→ Production traffic 영향 ↓.
|
||||
|
||||
### Read-only mode (locked phase)
|
||||
```ts
|
||||
// App-level
|
||||
const MAINTENANCE = await flags.get('maintenance');
|
||||
if (MAINTENANCE === 'readonly' && req.method !== 'GET') {
|
||||
return res.status(503).json({ error: 'maintenance' });
|
||||
}
|
||||
```
|
||||
|
||||
```sql
|
||||
-- DB-level (Postgres)
|
||||
ALTER DATABASE app SET default_transaction_read_only = on;
|
||||
-- Migration
|
||||
ALTER DATABASE app SET default_transaction_read_only = off;
|
||||
```
|
||||
|
||||
### Feature flag migration
|
||||
```
|
||||
1. Add feature behind flag (off)
|
||||
2. Deploy
|
||||
3. Enable for 1% (canary)
|
||||
4. Monitor 30 min
|
||||
5. Enable for 10%
|
||||
6. Monitor 1 hour
|
||||
7. 100%
|
||||
8. Cleanup flag (1 week later)
|
||||
```
|
||||
|
||||
→ [[Backend_Feature_Flags_Deep]].
|
||||
|
||||
### Canary deploy
|
||||
```
|
||||
1% → 10% → 50% → 100%
|
||||
|
||||
각 step 후 check:
|
||||
- Error rate
|
||||
- Latency p95
|
||||
- 사용자 신호
|
||||
|
||||
→ Issue 발견 시 rollback.
|
||||
```
|
||||
|
||||
### Communication template
|
||||
```markdown
|
||||
# Customer email (T-24h)
|
||||
|
||||
Subject: Scheduled maintenance: 2026-05-12 02:00 UTC
|
||||
|
||||
Hi {name},
|
||||
|
||||
We'll be performing maintenance on our service on **2026-05-12 from 02:00 to 02:30 UTC**.
|
||||
|
||||
During this time:
|
||||
- API will be in read-only mode for ~3 minutes
|
||||
- Web app will show a maintenance banner
|
||||
- No data will be lost
|
||||
|
||||
We're sorry for any inconvenience. If you have questions, contact support.
|
||||
|
||||
Thanks,
|
||||
The Acme Team
|
||||
```
|
||||
|
||||
### War room
|
||||
```
|
||||
큰 migration = Slack channel + Zoom open:
|
||||
- @oncall
|
||||
- @migration-author
|
||||
- @manager
|
||||
- 관계 팀
|
||||
|
||||
→ 빠른 의사결정 + 정보 공유.
|
||||
```
|
||||
|
||||
### Dry run
|
||||
```
|
||||
Production-like staging 에서 실행:
|
||||
- 같은 data volume
|
||||
- 같은 access pattern
|
||||
- 측정: time, lock, error
|
||||
|
||||
→ 실제 prod 의 ½ 시간 예측.
|
||||
```
|
||||
|
||||
### Backout (강 rollback)
|
||||
```
|
||||
"Rollback 가능?" 검증:
|
||||
- DB schema: drop column 가능 (data 잃음)
|
||||
- App code: 옛 version 호환?
|
||||
- Data: backfilled 되돌릴 수 있나?
|
||||
|
||||
→ 모든 step 의 rollback 명시.
|
||||
```
|
||||
|
||||
### Production 변경 종류
|
||||
```
|
||||
1. Schema migration (DB)
|
||||
2. Big data migration (table → table)
|
||||
3. Config 변경 (env, secret)
|
||||
4. Infra 변경 (instance type, region)
|
||||
5. Major release (new architecture)
|
||||
|
||||
→ 각 종류 의 runbook template.
|
||||
```
|
||||
|
||||
### Tooling
|
||||
```
|
||||
- Atlas / Liquibase / Flyway: schema migration
|
||||
- SQL guard rail (pg_lock_timeout)
|
||||
- Statement timeout
|
||||
- Monitoring (Grafana)
|
||||
- ChatOps (slack /command)
|
||||
```
|
||||
|
||||
### Postmortem (실패 시)
|
||||
```
|
||||
잘 됐어도 lessons learned.
|
||||
실패 시 — 즉시 postmortem.
|
||||
|
||||
→ 다음 migration 의 input.
|
||||
```
|
||||
|
||||
→ [[Productivity_Postmortem]].
|
||||
|
||||
### Sign-off process
|
||||
```
|
||||
큰 migration:
|
||||
- Author write runbook
|
||||
- DB / oncall review
|
||||
- Manager approval
|
||||
- Customer comms (legal / PM)
|
||||
- Day-of: 모두 ack
|
||||
|
||||
→ 책임 명시.
|
||||
```
|
||||
|
||||
### Day-of routine
|
||||
```
|
||||
T-1h: Final pre-check, all hands ack
|
||||
T-30min: Status page open
|
||||
T-0: Migration start
|
||||
T+...: 매 phase verify
|
||||
T+end: Status page resolve, summary
|
||||
```
|
||||
|
||||
### Common migration types
|
||||
```
|
||||
1. Add column: Easy. nullable → backfill → NOT NULL.
|
||||
2. Drop column: App code 가 reference 안 — drop.
|
||||
3. Rename column: Add new, dual-write, app switch, drop old.
|
||||
4. Type change: Like rename.
|
||||
5. Add unique constraint: Verify no dup → add.
|
||||
6. Big table partition: pg_repack / Citus.
|
||||
7. Index add: CONCURRENTLY.
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 변경 종류 | 추천 |
|
||||
|---|---|
|
||||
| Schema add column | Standard runbook |
|
||||
| Big data migration | Detailed + dry run |
|
||||
| Critical infra | War room + 24h notice |
|
||||
| Feature flag rollout | Canary + monitor |
|
||||
| Quick fix | Light runbook (still!) |
|
||||
| Reversible | Deploy + verify only |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **Plan 없는 production 변경**: 큰 incident.
|
||||
- **Rollback plan 없음**: 깨짐 = panic.
|
||||
- **Dry run 없음**: 모름.
|
||||
- **Single phase 큰 변경**: 깨짐 시 전체 rollback.
|
||||
- **Communication 없음**: 사용자 / 다른 팀 surprise.
|
||||
- **Sign-off skip**: 책임 unclear.
|
||||
- **Working hours 큰 migration**: 사용자 영향 큼.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- Phase 별 small reversible step.
|
||||
- 각 phase: action + verify + rollback.
|
||||
- Pre-check + post-check checklist.
|
||||
- Communication 24h+ 미리.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[Productivity_Postmortem]]
|
||||
- [[Productivity_Oncall_Playbook]]
|
||||
- [[DB_Migration_Safety]]
|
||||
- [[DevOps_Disaster_Recovery]]
|
||||
Reference in New Issue
Block a user