--- id: productivity-migration-runbook title: Migration Runbook — Plan / Verify / Rollback category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [productivity, migration, runbook, vibe-coding] tech_stack: { language: "Process", applicable_to: ["Engineering"] } applied_in: [] aliases: [migration runbook, deploy plan, rollback plan, schema migration, change management] --- # Migration Runbook > Critical change = 사전 plan + verify + rollback. **Phase 별 small step + checklist + rollback at each step**. DB / config / feature 모두. ## 📖 핵심 개념 - Phase: 작은 step. - Verify: 매 step 후 확인. - Rollback: 매 step 의 reverse. - Comms: 누가 / 언제 / 무엇. ## 💻 코드 패턴 ### Template ```markdown # Migration: Add `status` column to orders table **Owner:** @alice **Date:** 2026-05-12 02:00 UTC **Estimated duration:** 30 min **Severity:** Medium (read-only mode required) **Reviewers:** @bob (DB), @carol (oncall) ## Goal Add `status ENUM` column to `orders` table for new state machine feature. ## Background - Spec: - Related PRs: #123, #124, #125 ## Pre-checks - [ ] Production backup latest snapshot taken (auto) - [ ] Migration tested on staging (yes, ran 25 min) - [ ] Rollback script ready - [ ] Status page event scheduled - [ ] On-call notified - [ ] Engineering team notified in #engineering - [ ] Customer comms (email 24h ago) ## Steps ### Phase 1: Add nullable column (00:00 - 00:05) ```sql -- Online safe — metadata only on PG 11+ ALTER TABLE orders ADD COLUMN status TEXT; ``` **Verify:** `\d orders` shows new column. **Rollback:** `ALTER TABLE orders DROP COLUMN status;` ### Phase 2: Backfill (00:05 - 00:25) Backfill existing rows in batches: ```sql DO $$ DECLARE batch_size INT := 10000; BEGIN LOOP UPDATE orders SET status = 'paid' WHERE status IS NULL AND id IN ( SELECT id FROM orders WHERE status IS NULL LIMIT batch_size ); EXIT WHEN NOT FOUND; PERFORM pg_sleep(0.5); -- 다른 query 영향 줄임 END LOOP; END $$; ``` **Verify:** `SELECT COUNT(*) FROM orders WHERE status IS NULL;` = 0 **Rollback:** N/A (안전) ### Phase 3: Add NOT NULL (00:25 - 00:27) ```sql -- Read-only mode 시작 (3 min) ALTER TABLE orders ALTER COLUMN status SET NOT NULL; ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending'; ``` **Verify:** Column NOT NULL. **Rollback:** Drop NOT NULL constraint. ### Phase 4: Deploy app code (00:27 - 00:30) - App reads `status` column (existing 코드 fallback to 'paid' if NULL) - Deploy via standard CI/CD. **Verify:** Test endpoint works. **Rollback:** Revert deploy. ## Post-checks - [ ] All endpoints respond < 200ms - [ ] Error rate < 1% - [ ] No DB lock contention (`pg_stat_activity`) - [ ] Customer-facing test scenarios pass ## Communication - **T-24h:** Email customers about scheduled maintenance - **T-30min:** Slack #engineering announcement - **T-0:** Status page event start - **T+30:** Status page event end + summary ## Risk assessment | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Migration timeout | Low | High | Backfill in small batches | | Lock contention | Medium | High | Read-only mode for Phase 3 | | Backfill long | Medium | Low | Sleep 0.5s between batches | | App code break | Low | Medium | Fallback for NULL | ## Abort criteria - Phase 2 가 1 hour 이상 걸리면 → continue (backfill 안전) - Phase 3 가 5 min 이상 lock 시 → kill + investigate - Phase 4 deploy 실패 시 → revert + investigate ## Sign-off - [ ] @alice (Author) - [ ] @bob (DB) - [ ] @carol (Oncall) - [ ] @dave (PM) ## Post-mortem (if incident) [link] ``` ### Phase 별 작은 step ``` 큰 PR 한 번 X. 작은 reversible step 다중. 각 phase: 1. Action 2. Verify 3. Rollback (if needed) ``` ### DB schema migration 패턴 ``` 1. Add nullable column (deploy) 2. Backfill (background) 3. App writes new + old (deploy) 4. App reads from new (deploy) 5. Drop old column (deploy) → N 단계 deploy. 매 단계 reversible. ``` → Expand-contract pattern. ### Online migration tools ``` gh-ost (GitHub): MySQL online schema change pt-online-schema-change (Percona): MySQL pg_repack: Postgres table rewrite pg_squeeze: Postgres bloat Citus: Postgres partition ``` → Lock 없이 큰 table 변경. ### Backfill 전략 ```sql -- Batch + sleep DO $$ BEGIN LOOP UPDATE table SET col = ... WHERE id BETWEEN $low AND $high; EXIT WHEN $high >= max_id; PERFORM pg_sleep(0.1); END LOOP; END $$; ``` ```ts // App-level batch async function backfill() { let lastId = 0; while (true) { const batch = await db.execute( 'UPDATE table SET col = ? WHERE id > ? ORDER BY id LIMIT 10000', [value, lastId] ); if (batch.affectedRows === 0) break; lastId = batch.lastId; await sleep(500); } } ``` → Production traffic 영향 ↓. ### Read-only mode (locked phase) ```ts // App-level const MAINTENANCE = await flags.get('maintenance'); if (MAINTENANCE === 'readonly' && req.method !== 'GET') { return res.status(503).json({ error: 'maintenance' }); } ``` ```sql -- DB-level (Postgres) ALTER DATABASE app SET default_transaction_read_only = on; -- Migration ALTER DATABASE app SET default_transaction_read_only = off; ``` ### Feature flag migration ``` 1. Add feature behind flag (off) 2. Deploy 3. Enable for 1% (canary) 4. Monitor 30 min 5. Enable for 10% 6. Monitor 1 hour 7. 100% 8. Cleanup flag (1 week later) ``` → [[Backend_Feature_Flags_Deep]]. ### Canary deploy ``` 1% → 10% → 50% → 100% 각 step 후 check: - Error rate - Latency p95 - 사용자 신호 → Issue 발견 시 rollback. ``` ### Communication template ```markdown # Customer email (T-24h) Subject: Scheduled maintenance: 2026-05-12 02:00 UTC Hi {name}, We'll be performing maintenance on our service on **2026-05-12 from 02:00 to 02:30 UTC**. During this time: - API will be in read-only mode for ~3 minutes - Web app will show a maintenance banner - No data will be lost We're sorry for any inconvenience. If you have questions, contact support. Thanks, The Acme Team ``` ### War room ``` 큰 migration = Slack channel + Zoom open: - @oncall - @migration-author - @manager - 관계 팀 → 빠른 의사결정 + 정보 공유. ``` ### Dry run ``` Production-like staging 에서 실행: - 같은 data volume - 같은 access pattern - 측정: time, lock, error → 실제 prod 의 ½ 시간 예측. ``` ### Backout (강 rollback) ``` "Rollback 가능?" 검증: - DB schema: drop column 가능 (data 잃음) - App code: 옛 version 호환? - Data: backfilled 되돌릴 수 있나? → 모든 step 의 rollback 명시. ``` ### Production 변경 종류 ``` 1. Schema migration (DB) 2. Big data migration (table → table) 3. Config 변경 (env, secret) 4. Infra 변경 (instance type, region) 5. Major release (new architecture) → 각 종류 의 runbook template. ``` ### Tooling ``` - Atlas / Liquibase / Flyway: schema migration - SQL guard rail (pg_lock_timeout) - Statement timeout - Monitoring (Grafana) - ChatOps (slack /command) ``` ### Postmortem (실패 시) ``` 잘 됐어도 lessons learned. 실패 시 — 즉시 postmortem. → 다음 migration 의 input. ``` → [[Productivity_Postmortem]]. ### Sign-off process ``` 큰 migration: - Author write runbook - DB / oncall review - Manager approval - Customer comms (legal / PM) - Day-of: 모두 ack → 책임 명시. ``` ### Day-of routine ``` T-1h: Final pre-check, all hands ack T-30min: Status page open T-0: Migration start T+...: 매 phase verify T+end: Status page resolve, summary ``` ### Common migration types ``` 1. Add column: Easy. nullable → backfill → NOT NULL. 2. Drop column: App code 가 reference 안 — drop. 3. Rename column: Add new, dual-write, app switch, drop old. 4. Type change: Like rename. 5. Add unique constraint: Verify no dup → add. 6. Big table partition: pg_repack / Citus. 7. Index add: CONCURRENTLY. ``` ## 🤔 의사결정 기준 | 변경 종류 | 추천 | |---|---| | Schema add column | Standard runbook | | Big data migration | Detailed + dry run | | Critical infra | War room + 24h notice | | Feature flag rollout | Canary + monitor | | Quick fix | Light runbook (still!) | | Reversible | Deploy + verify only | ## ❌ 안티패턴 - **Plan 없는 production 변경**: 큰 incident. - **Rollback plan 없음**: 깨짐 = panic. - **Dry run 없음**: 모름. - **Single phase 큰 변경**: 깨짐 시 전체 rollback. - **Communication 없음**: 사용자 / 다른 팀 surprise. - **Sign-off skip**: 책임 unclear. - **Working hours 큰 migration**: 사용자 영향 큼. ## 🤖 LLM 활용 힌트 - Phase 별 small reversible step. - 각 phase: action + verify + rollback. - Pre-check + post-check checklist. - Communication 24h+ 미리. ## 🔗 관련 문서 - [[Productivity_Postmortem]] - [[Productivity_Oncall_Playbook]] - [[DB_Migration_Safety]] - [[DevOps_Disaster_Recovery]]