[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,385 @@
---
id: productivity-migration-runbook
title: Migration Runbook — Plan / Verify / Rollback
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [productivity, migration, runbook, vibe-coding]
tech_stack: { language: "Process", applicable_to: ["Engineering"] }
applied_in: []
aliases: [migration runbook, deploy plan, rollback plan, schema migration, change management]
---
# Migration Runbook
> Critical change = 사전 plan + verify + rollback. **Phase 별 small step + checklist + rollback at each step**. DB / config / feature 모두.
## 📖 핵심 개념
- Phase: 작은 step.
- Verify: 매 step 후 확인.
- Rollback: 매 step 의 reverse.
- Comms: 누가 / 언제 / 무엇.
## 💻 코드 패턴
### Template
```markdown
# Migration: Add `status` column to orders table
**Owner:** @alice
**Date:** 2026-05-12 02:00 UTC
**Estimated duration:** 30 min
**Severity:** Medium (read-only mode required)
**Reviewers:** @bob (DB), @carol (oncall)
## Goal
Add `status ENUM` column to `orders` table for new state machine feature.
## Background
- Spec: <link>
- Related PRs: #123, #124, #125
## Pre-checks
- [ ] Production backup latest snapshot taken (auto)
- [ ] Migration tested on staging (yes, ran 25 min)
- [ ] Rollback script ready
- [ ] Status page event scheduled
- [ ] On-call notified
- [ ] Engineering team notified in #engineering
- [ ] Customer comms (email 24h ago)
## Steps
### Phase 1: Add nullable column (00:00 - 00:05)
```sql
-- Online safe — metadata only on PG 11+
ALTER TABLE orders ADD COLUMN status TEXT;
```
**Verify:** `\d orders` shows new column.
**Rollback:** `ALTER TABLE orders DROP COLUMN status;`
### Phase 2: Backfill (00:05 - 00:25)
Backfill existing rows in batches:
```sql
DO $$
DECLARE
batch_size INT := 10000;
BEGIN
LOOP
UPDATE orders SET status = 'paid'
WHERE status IS NULL AND id IN (
SELECT id FROM orders WHERE status IS NULL LIMIT batch_size
);
EXIT WHEN NOT FOUND;
PERFORM pg_sleep(0.5); -- 다른 query 영향 줄임
END LOOP;
END $$;
```
**Verify:** `SELECT COUNT(*) FROM orders WHERE status IS NULL;` = 0
**Rollback:** N/A (안전)
### Phase 3: Add NOT NULL (00:25 - 00:27)
```sql
-- Read-only mode 시작 (3 min)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
```
**Verify:** Column NOT NULL.
**Rollback:** Drop NOT NULL constraint.
### Phase 4: Deploy app code (00:27 - 00:30)
- App reads `status` column (existing 코드 fallback to 'paid' if NULL)
- Deploy via standard CI/CD.
**Verify:** Test endpoint works.
**Rollback:** Revert deploy.
## Post-checks
- [ ] All endpoints respond < 200ms
- [ ] Error rate < 1%
- [ ] No DB lock contention (`pg_stat_activity`)
- [ ] Customer-facing test scenarios pass
## Communication
- **T-24h:** Email customers about scheduled maintenance
- **T-30min:** Slack #engineering announcement
- **T-0:** Status page event start
- **T+30:** Status page event end + summary
## Risk assessment
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Migration timeout | Low | High | Backfill in small batches |
| Lock contention | Medium | High | Read-only mode for Phase 3 |
| Backfill long | Medium | Low | Sleep 0.5s between batches |
| App code break | Low | Medium | Fallback for NULL |
## Abort criteria
- Phase 2 가 1 hour 이상 걸리면 → continue (backfill 안전)
- Phase 3 가 5 min 이상 lock 시 → kill + investigate
- Phase 4 deploy 실패 시 → revert + investigate
## Sign-off
- [ ] @alice (Author)
- [ ] @bob (DB)
- [ ] @carol (Oncall)
- [ ] @dave (PM)
## Post-mortem (if incident)
[link]
```
### Phase 별 작은 step
```
큰 PR 한 번 X.
작은 reversible step 다중.
각 phase:
1. Action
2. Verify
3. Rollback (if needed)
```
### DB schema migration 패턴
```
1. Add nullable column (deploy)
2. Backfill (background)
3. App writes new + old (deploy)
4. App reads from new (deploy)
5. Drop old column (deploy)
→ N 단계 deploy. 매 단계 reversible.
```
→ Expand-contract pattern.
### Online migration tools
```
gh-ost (GitHub): MySQL online schema change
pt-online-schema-change (Percona): MySQL
pg_repack: Postgres table rewrite
pg_squeeze: Postgres bloat
Citus: Postgres partition
```
→ Lock 없이 큰 table 변경.
### Backfill 전략
```sql
-- Batch + sleep
DO $$
BEGIN
LOOP
UPDATE table SET col = ... WHERE id BETWEEN $low AND $high;
EXIT WHEN $high >= max_id;
PERFORM pg_sleep(0.1);
END LOOP;
END $$;
```
```ts
// App-level batch
async function backfill() {
let lastId = 0;
while (true) {
const batch = await db.execute(
'UPDATE table SET col = ? WHERE id > ? ORDER BY id LIMIT 10000',
[value, lastId]
);
if (batch.affectedRows === 0) break;
lastId = batch.lastId;
await sleep(500);
}
}
```
→ Production traffic 영향 ↓.
### Read-only mode (locked phase)
```ts
// App-level
const MAINTENANCE = await flags.get('maintenance');
if (MAINTENANCE === 'readonly' && req.method !== 'GET') {
return res.status(503).json({ error: 'maintenance' });
}
```
```sql
-- DB-level (Postgres)
ALTER DATABASE app SET default_transaction_read_only = on;
-- Migration
ALTER DATABASE app SET default_transaction_read_only = off;
```
### Feature flag migration
```
1. Add feature behind flag (off)
2. Deploy
3. Enable for 1% (canary)
4. Monitor 30 min
5. Enable for 10%
6. Monitor 1 hour
7. 100%
8. Cleanup flag (1 week later)
```
→ [[Backend_Feature_Flags_Deep]].
### Canary deploy
```
1% → 10% → 50% → 100%
각 step 후 check:
- Error rate
- Latency p95
- 사용자 신호
→ Issue 발견 시 rollback.
```
### Communication template
```markdown
# Customer email (T-24h)
Subject: Scheduled maintenance: 2026-05-12 02:00 UTC
Hi {name},
We'll be performing maintenance on our service on **2026-05-12 from 02:00 to 02:30 UTC**.
During this time:
- API will be in read-only mode for ~3 minutes
- Web app will show a maintenance banner
- No data will be lost
We're sorry for any inconvenience. If you have questions, contact support.
Thanks,
The Acme Team
```
### War room
```
큰 migration = Slack channel + Zoom open:
- @oncall
- @migration-author
- @manager
- 관계 팀
→ 빠른 의사결정 + 정보 공유.
```
### Dry run
```
Production-like staging 에서 실행:
- 같은 data volume
- 같은 access pattern
- 측정: time, lock, error
→ 실제 prod 의 ½ 시간 예측.
```
### Backout (강 rollback)
```
"Rollback 가능?" 검증:
- DB schema: drop column 가능 (data 잃음)
- App code: 옛 version 호환?
- Data: backfilled 되돌릴 수 있나?
→ 모든 step 의 rollback 명시.
```
### Production 변경 종류
```
1. Schema migration (DB)
2. Big data migration (table → table)
3. Config 변경 (env, secret)
4. Infra 변경 (instance type, region)
5. Major release (new architecture)
→ 각 종류 의 runbook template.
```
### Tooling
```
- Atlas / Liquibase / Flyway: schema migration
- SQL guard rail (pg_lock_timeout)
- Statement timeout
- Monitoring (Grafana)
- ChatOps (slack /command)
```
### Postmortem (실패 시)
```
잘 됐어도 lessons learned.
실패 시 — 즉시 postmortem.
→ 다음 migration 의 input.
```
→ [[Productivity_Postmortem]].
### Sign-off process
```
큰 migration:
- Author write runbook
- DB / oncall review
- Manager approval
- Customer comms (legal / PM)
- Day-of: 모두 ack
→ 책임 명시.
```
### Day-of routine
```
T-1h: Final pre-check, all hands ack
T-30min: Status page open
T-0: Migration start
T+...: 매 phase verify
T+end: Status page resolve, summary
```
### Common migration types
```
1. Add column: Easy. nullable → backfill → NOT NULL.
2. Drop column: App code 가 reference 안 — drop.
3. Rename column: Add new, dual-write, app switch, drop old.
4. Type change: Like rename.
5. Add unique constraint: Verify no dup → add.
6. Big table partition: pg_repack / Citus.
7. Index add: CONCURRENTLY.
```
## 🤔 의사결정 기준
| 변경 종류 | 추천 |
|---|---|
| Schema add column | Standard runbook |
| Big data migration | Detailed + dry run |
| Critical infra | War room + 24h notice |
| Feature flag rollout | Canary + monitor |
| Quick fix | Light runbook (still!) |
| Reversible | Deploy + verify only |
## ❌ 안티패턴
- **Plan 없는 production 변경**: 큰 incident.
- **Rollback plan 없음**: 깨짐 = panic.
- **Dry run 없음**: 모름.
- **Single phase 큰 변경**: 깨짐 시 전체 rollback.
- **Communication 없음**: 사용자 / 다른 팀 surprise.
- **Sign-off skip**: 책임 unclear.
- **Working hours 큰 migration**: 사용자 영향 큼.
## 🤖 LLM 활용 힌트
- Phase 별 small reversible step.
- 각 phase: action + verify + rollback.
- Pre-check + post-check checklist.
- Communication 24h+ 미리.
## 🔗 관련 문서
- [[Productivity_Postmortem]]
- [[Productivity_Oncall_Playbook]]
- [[DB_Migration_Safety]]
- [[DevOps_Disaster_Recovery]]