Files
2nd/10_Wiki/Topics/Coding/Productivity_Migration_Runbook.md
T
2026-05-09 21:08:02 +09:00

8.9 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
productivity-migration-runbook Migration Runbook — Plan / Verify / Rollback Coding draft B conceptual 2026-05-09 2026-05-09
productivity
migration
runbook
vibe-coding
language applicable_to
Process
Engineering
migration runbook
deploy plan
rollback plan
schema migration
change management

Migration Runbook

Critical change = 사전 plan + verify + rollback. Phase 별 small step + checklist + rollback at each step. DB / config / feature 모두.

📖 핵심 개념

  • Phase: 작은 step.
  • Verify: 매 step 후 확인.
  • Rollback: 매 step 의 reverse.
  • Comms: 누가 / 언제 / 무엇.

💻 코드 패턴

Template

# Migration: Add `status` column to orders table

**Owner:** @alice
**Date:** 2026-05-12 02:00 UTC
**Estimated duration:** 30 min
**Severity:** Medium (read-only mode required)
**Reviewers:** @bob (DB), @carol (oncall)

## Goal
Add `status ENUM` column to `orders` table for new state machine feature.

## Background
- Spec: <link>
- Related PRs: #123, #124, #125

## Pre-checks
- [ ] Production backup latest snapshot taken (auto)
- [ ] Migration tested on staging (yes, ran 25 min)
- [ ] Rollback script ready
- [ ] Status page event scheduled
- [ ] On-call notified
- [ ] Engineering team notified in #engineering
- [ ] Customer comms (email 24h ago)

## Steps

### Phase 1: Add nullable column (00:00 - 00:05)
```sql
-- Online safe — metadata only on PG 11+
ALTER TABLE orders ADD COLUMN status TEXT;

Verify: \d orders shows new column. Rollback: ALTER TABLE orders DROP COLUMN status;

Phase 2: Backfill (00:05 - 00:25)

Backfill existing rows in batches:

DO $$
DECLARE
  batch_size INT := 10000;
BEGIN
  LOOP
    UPDATE orders SET status = 'paid' 
    WHERE status IS NULL AND id IN (
      SELECT id FROM orders WHERE status IS NULL LIMIT batch_size
    );
    EXIT WHEN NOT FOUND;
    PERFORM pg_sleep(0.5);  -- 다른 query 영향 줄임
  END LOOP;
END $$;

Verify: SELECT COUNT(*) FROM orders WHERE status IS NULL; = 0 Rollback: N/A (안전)

Phase 3: Add NOT NULL (00:25 - 00:27)

-- Read-only mode 시작 (3 min)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';

Verify: Column NOT NULL. Rollback: Drop NOT NULL constraint.

Phase 4: Deploy app code (00:27 - 00:30)

  • App reads status column (existing 코드 fallback to 'paid' if NULL)
  • Deploy via standard CI/CD. Verify: Test endpoint works. Rollback: Revert deploy.

Post-checks

  • All endpoints respond < 200ms
  • Error rate < 1%
  • No DB lock contention (pg_stat_activity)
  • Customer-facing test scenarios pass

Communication

  • T-24h: Email customers about scheduled maintenance
  • T-30min: Slack #engineering announcement
  • T-0: Status page event start
  • T+30: Status page event end + summary

Risk assessment

Risk Likelihood Impact Mitigation
Migration timeout Low High Backfill in small batches
Lock contention Medium High Read-only mode for Phase 3
Backfill long Medium Low Sleep 0.5s between batches
App code break Low Medium Fallback for NULL

Abort criteria

  • Phase 2 가 1 hour 이상 걸리면 → continue (backfill 안전)
  • Phase 3 가 5 min 이상 lock 시 → kill + investigate
  • Phase 4 deploy 실패 시 → revert + investigate

Sign-off

  • @alice (Author)
  • @bob (DB)
  • @carol (Oncall)
  • @dave (PM)

Post-mortem (if incident)

[link]


### Phase 별 작은 step

큰 PR 한 번 X. 작은 reversible step 다중.

각 phase:

  1. Action
  2. Verify
  3. Rollback (if needed)

### DB schema migration 패턴
  1. Add nullable column (deploy)
  2. Backfill (background)
  3. App writes new + old (deploy)
  4. App reads from new (deploy)
  5. Drop old column (deploy)

→ N 단계 deploy. 매 단계 reversible.


→ Expand-contract pattern.

### Online migration tools

gh-ost (GitHub): MySQL online schema change pt-online-schema-change (Percona): MySQL pg_repack: Postgres table rewrite pg_squeeze: Postgres bloat Citus: Postgres partition


→ Lock 없이 큰 table 변경.

### Backfill 전략
```sql
-- Batch + sleep
DO $$
BEGIN
  LOOP
    UPDATE table SET col = ... WHERE id BETWEEN $low AND $high;
    EXIT WHEN $high >= max_id;
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;
// App-level batch
async function backfill() {
  let lastId = 0;
  while (true) {
    const batch = await db.execute(
      'UPDATE table SET col = ? WHERE id > ? ORDER BY id LIMIT 10000',
      [value, lastId]
    );
    if (batch.affectedRows === 0) break;
    lastId = batch.lastId;
    await sleep(500);
  }
}

→ Production traffic 영향 ↓.

Read-only mode (locked phase)

// App-level
const MAINTENANCE = await flags.get('maintenance');
if (MAINTENANCE === 'readonly' && req.method !== 'GET') {
  return res.status(503).json({ error: 'maintenance' });
}
-- DB-level (Postgres)
ALTER DATABASE app SET default_transaction_read_only = on;
-- Migration
ALTER DATABASE app SET default_transaction_read_only = off;

Feature flag migration

1. Add feature behind flag (off)
2. Deploy
3. Enable for 1% (canary)
4. Monitor 30 min
5. Enable for 10%
6. Monitor 1 hour
7. 100%
8. Cleanup flag (1 week later)

Backend_Feature_Flags_Deep.

Canary deploy

1% → 10% → 50% → 100%

각 step 후 check:
- Error rate
- Latency p95
- 사용자 신호

→ Issue 발견 시 rollback.

Communication template

# Customer email (T-24h)

Subject: Scheduled maintenance: 2026-05-12 02:00 UTC

Hi {name},

We'll be performing maintenance on our service on **2026-05-12 from 02:00 to 02:30 UTC**.

During this time:
- API will be in read-only mode for ~3 minutes
- Web app will show a maintenance banner
- No data will be lost

We're sorry for any inconvenience. If you have questions, contact support.

Thanks,
The Acme Team

War room

큰 migration = Slack channel + Zoom open:
- @oncall
- @migration-author
- @manager
- 관계 팀

→ 빠른 의사결정 + 정보 공유.

Dry run

Production-like staging 에서 실행:
- 같은 data volume
- 같은 access pattern
- 측정: time, lock, error

→ 실제 prod 의 ½ 시간 예측.

Backout (강 rollback)

"Rollback 가능?" 검증:
- DB schema: drop column 가능 (data 잃음)
- App code: 옛 version 호환?
- Data: backfilled 되돌릴 수 있나?

→ 모든 step 의 rollback 명시.

Production 변경 종류

1. Schema migration (DB)
2. Big data migration (table → table)
3. Config 변경 (env, secret)
4. Infra 변경 (instance type, region)
5. Major release (new architecture)

→ 각 종류 의 runbook template.

Tooling

- Atlas / Liquibase / Flyway: schema migration
- SQL guard rail (pg_lock_timeout)
- Statement timeout
- Monitoring (Grafana)
- ChatOps (slack /command)

Postmortem (실패 시)

잘 됐어도 lessons learned.
실패 시 — 즉시 postmortem.

→ 다음 migration 의 input.

Productivity_Postmortem.

Sign-off process

큰 migration:
- Author write runbook
- DB / oncall review
- Manager approval
- Customer comms (legal / PM)
- Day-of: 모두 ack

→ 책임 명시.

Day-of routine

T-1h:   Final pre-check, all hands ack
T-30min: Status page open
T-0:    Migration start
T+...:  매 phase verify
T+end:  Status page resolve, summary

Common migration types

1. Add column:           Easy. nullable → backfill → NOT NULL.
2. Drop column:          App code 가 reference 안 — drop.
3. Rename column:        Add new, dual-write, app switch, drop old.
4. Type change:          Like rename.
5. Add unique constraint: Verify no dup → add.
6. Big table partition:  pg_repack / Citus.
7. Index add:            CONCURRENTLY.

🤔 의사결정 기준

변경 종류 추천
Schema add column Standard runbook
Big data migration Detailed + dry run
Critical infra War room + 24h notice
Feature flag rollout Canary + monitor
Quick fix Light runbook (still!)
Reversible Deploy + verify only

안티패턴

  • Plan 없는 production 변경: 큰 incident.
  • Rollback plan 없음: 깨짐 = panic.
  • Dry run 없음: 모름.
  • Single phase 큰 변경: 깨짐 시 전체 rollback.
  • Communication 없음: 사용자 / 다른 팀 surprise.
  • Sign-off skip: 책임 unclear.
  • Working hours 큰 migration: 사용자 영향 큼.

🤖 LLM 활용 힌트

  • Phase 별 small reversible step.
  • 각 phase: action + verify + rollback.
  • Pre-check + post-check checklist.
  • Communication 24h+ 미리.

🔗 관련 문서