8.9 KiB
8.9 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| productivity-migration-runbook | Migration Runbook — Plan / Verify / Rollback | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Migration Runbook
Critical change = 사전 plan + verify + rollback. Phase 별 small step + checklist + rollback at each step. DB / config / feature 모두.
📖 핵심 개념
- Phase: 작은 step.
- Verify: 매 step 후 확인.
- Rollback: 매 step 의 reverse.
- Comms: 누가 / 언제 / 무엇.
💻 코드 패턴
Template
# Migration: Add `status` column to orders table
**Owner:** @alice
**Date:** 2026-05-12 02:00 UTC
**Estimated duration:** 30 min
**Severity:** Medium (read-only mode required)
**Reviewers:** @bob (DB), @carol (oncall)
## Goal
Add `status ENUM` column to `orders` table for new state machine feature.
## Background
- Spec: <link>
- Related PRs: #123, #124, #125
## Pre-checks
- [ ] Production backup latest snapshot taken (auto)
- [ ] Migration tested on staging (yes, ran 25 min)
- [ ] Rollback script ready
- [ ] Status page event scheduled
- [ ] On-call notified
- [ ] Engineering team notified in #engineering
- [ ] Customer comms (email 24h ago)
## Steps
### Phase 1: Add nullable column (00:00 - 00:05)
```sql
-- Online safe — metadata only on PG 11+
ALTER TABLE orders ADD COLUMN status TEXT;
Verify: \d orders shows new column.
Rollback: ALTER TABLE orders DROP COLUMN status;
Phase 2: Backfill (00:05 - 00:25)
Backfill existing rows in batches:
DO $$
DECLARE
batch_size INT := 10000;
BEGIN
LOOP
UPDATE orders SET status = 'paid'
WHERE status IS NULL AND id IN (
SELECT id FROM orders WHERE status IS NULL LIMIT batch_size
);
EXIT WHEN NOT FOUND;
PERFORM pg_sleep(0.5); -- 다른 query 영향 줄임
END LOOP;
END $$;
Verify: SELECT COUNT(*) FROM orders WHERE status IS NULL; = 0
Rollback: N/A (안전)
Phase 3: Add NOT NULL (00:25 - 00:27)
-- Read-only mode 시작 (3 min)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
Verify: Column NOT NULL. Rollback: Drop NOT NULL constraint.
Phase 4: Deploy app code (00:27 - 00:30)
- App reads
statuscolumn (existing 코드 fallback to 'paid' if NULL) - Deploy via standard CI/CD. Verify: Test endpoint works. Rollback: Revert deploy.
Post-checks
- All endpoints respond < 200ms
- Error rate < 1%
- No DB lock contention (
pg_stat_activity) - Customer-facing test scenarios pass
Communication
- T-24h: Email customers about scheduled maintenance
- T-30min: Slack #engineering announcement
- T-0: Status page event start
- T+30: Status page event end + summary
Risk assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Migration timeout | Low | High | Backfill in small batches |
| Lock contention | Medium | High | Read-only mode for Phase 3 |
| Backfill long | Medium | Low | Sleep 0.5s between batches |
| App code break | Low | Medium | Fallback for NULL |
Abort criteria
- Phase 2 가 1 hour 이상 걸리면 → continue (backfill 안전)
- Phase 3 가 5 min 이상 lock 시 → kill + investigate
- Phase 4 deploy 실패 시 → revert + investigate
Sign-off
- @alice (Author)
- @bob (DB)
- @carol (Oncall)
- @dave (PM)
Post-mortem (if incident)
[link]
### Phase 별 작은 step
큰 PR 한 번 X. 작은 reversible step 다중.
각 phase:
- Action
- Verify
- Rollback (if needed)
### DB schema migration 패턴
- Add nullable column (deploy)
- Backfill (background)
- App writes new + old (deploy)
- App reads from new (deploy)
- Drop old column (deploy)
→ N 단계 deploy. 매 단계 reversible.
→ Expand-contract pattern.
### Online migration tools
gh-ost (GitHub): MySQL online schema change pt-online-schema-change (Percona): MySQL pg_repack: Postgres table rewrite pg_squeeze: Postgres bloat Citus: Postgres partition
→ Lock 없이 큰 table 변경.
### Backfill 전략
```sql
-- Batch + sleep
DO $$
BEGIN
LOOP
UPDATE table SET col = ... WHERE id BETWEEN $low AND $high;
EXIT WHEN $high >= max_id;
PERFORM pg_sleep(0.1);
END LOOP;
END $$;
// App-level batch
async function backfill() {
let lastId = 0;
while (true) {
const batch = await db.execute(
'UPDATE table SET col = ? WHERE id > ? ORDER BY id LIMIT 10000',
[value, lastId]
);
if (batch.affectedRows === 0) break;
lastId = batch.lastId;
await sleep(500);
}
}
→ Production traffic 영향 ↓.
Read-only mode (locked phase)
// App-level
const MAINTENANCE = await flags.get('maintenance');
if (MAINTENANCE === 'readonly' && req.method !== 'GET') {
return res.status(503).json({ error: 'maintenance' });
}
-- DB-level (Postgres)
ALTER DATABASE app SET default_transaction_read_only = on;
-- Migration
ALTER DATABASE app SET default_transaction_read_only = off;
Feature flag migration
1. Add feature behind flag (off)
2. Deploy
3. Enable for 1% (canary)
4. Monitor 30 min
5. Enable for 10%
6. Monitor 1 hour
7. 100%
8. Cleanup flag (1 week later)
Canary deploy
1% → 10% → 50% → 100%
각 step 후 check:
- Error rate
- Latency p95
- 사용자 신호
→ Issue 발견 시 rollback.
Communication template
# Customer email (T-24h)
Subject: Scheduled maintenance: 2026-05-12 02:00 UTC
Hi {name},
We'll be performing maintenance on our service on **2026-05-12 from 02:00 to 02:30 UTC**.
During this time:
- API will be in read-only mode for ~3 minutes
- Web app will show a maintenance banner
- No data will be lost
We're sorry for any inconvenience. If you have questions, contact support.
Thanks,
The Acme Team
War room
큰 migration = Slack channel + Zoom open:
- @oncall
- @migration-author
- @manager
- 관계 팀
→ 빠른 의사결정 + 정보 공유.
Dry run
Production-like staging 에서 실행:
- 같은 data volume
- 같은 access pattern
- 측정: time, lock, error
→ 실제 prod 의 ½ 시간 예측.
Backout (강 rollback)
"Rollback 가능?" 검증:
- DB schema: drop column 가능 (data 잃음)
- App code: 옛 version 호환?
- Data: backfilled 되돌릴 수 있나?
→ 모든 step 의 rollback 명시.
Production 변경 종류
1. Schema migration (DB)
2. Big data migration (table → table)
3. Config 변경 (env, secret)
4. Infra 변경 (instance type, region)
5. Major release (new architecture)
→ 각 종류 의 runbook template.
Tooling
- Atlas / Liquibase / Flyway: schema migration
- SQL guard rail (pg_lock_timeout)
- Statement timeout
- Monitoring (Grafana)
- ChatOps (slack /command)
Postmortem (실패 시)
잘 됐어도 lessons learned.
실패 시 — 즉시 postmortem.
→ 다음 migration 의 input.
Sign-off process
큰 migration:
- Author write runbook
- DB / oncall review
- Manager approval
- Customer comms (legal / PM)
- Day-of: 모두 ack
→ 책임 명시.
Day-of routine
T-1h: Final pre-check, all hands ack
T-30min: Status page open
T-0: Migration start
T+...: 매 phase verify
T+end: Status page resolve, summary
Common migration types
1. Add column: Easy. nullable → backfill → NOT NULL.
2. Drop column: App code 가 reference 안 — drop.
3. Rename column: Add new, dual-write, app switch, drop old.
4. Type change: Like rename.
5. Add unique constraint: Verify no dup → add.
6. Big table partition: pg_repack / Citus.
7. Index add: CONCURRENTLY.
🤔 의사결정 기준
| 변경 종류 | 추천 |
|---|---|
| Schema add column | Standard runbook |
| Big data migration | Detailed + dry run |
| Critical infra | War room + 24h notice |
| Feature flag rollout | Canary + monitor |
| Quick fix | Light runbook (still!) |
| Reversible | Deploy + verify only |
❌ 안티패턴
- Plan 없는 production 변경: 큰 incident.
- Rollback plan 없음: 깨짐 = panic.
- Dry run 없음: 모름.
- Single phase 큰 변경: 깨짐 시 전체 rollback.
- Communication 없음: 사용자 / 다른 팀 surprise.
- Sign-off skip: 책임 unclear.
- Working hours 큰 migration: 사용자 영향 큼.
🤖 LLM 활용 힌트
- Phase 별 small reversible step.
- 각 phase: action + verify + rollback.
- Pre-check + post-check checklist.
- Communication 24h+ 미리.