--- id: devops-disaster-recovery title: Disaster Recovery — RPO / RTO / Backup / Failover category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [devops, dr, backup, vibe-coding] tech_stack: { language: "Bash / SQL / Terraform", applicable_to: ["DevOps"] } applied_in: [] aliases: [disaster recovery, RPO, RTO, backup, restore, failover, runbook] --- # Disaster Recovery > "Backup 있다" 만으론 부족. **RPO (잃어도 되는 시간) + RTO (복구 시간) 정의 → 정기 복원 테스트**. Region failover, restore drill, runbook. ## 📖 핵심 개념 - RPO: 마지막 백업 시점부터 disaster 사이 잃은 데이터. - RTO: disaster → 복구까지 시간. - Backup ≠ DR. 복원 가능해야 의미. - 3-2-1 rule: 3 copies / 2 medium / 1 offsite. ## 💻 코드 패턴 ### Postgres backup (pg_basebackup + WAL) ```bash # Base backup pg_basebackup -D /backup/base -Ft -z -P # WAL archiving (continuous) # postgresql.conf archive_mode = on archive_command = 'aws s3 cp %p s3://wal-archive/%f' # Point-in-time recovery # 1. base 복원 # 2. WAL replay until 특정 timestamp restore_command = 'aws s3 cp s3://wal-archive/%f %p' recovery_target_time = '2026-05-09 14:00:00' ``` → RPO 거의 0 (WAL 마지막), RTO 분 단위. ### AWS RDS automated ```hcl resource "aws_db_instance" "main" { backup_retention_period = 30 # 30일 backup_window = "03:00-04:00" delete_automated_backups = false copy_tags_to_snapshot = true deletion_protection = true } # Cross-region replica resource "aws_db_instance" "replica" { replicate_source_db = aws_db_instance.main.identifier region = "us-west-2" } ``` ### Postgres logical backup ```bash # Daily pg_dump -F c -d app | gzip | aws s3 cp - s3://backup/db-$(date +%Y%m%d).dump.gz # Restore aws s3 cp s3://backup/db-2026-05-09.dump.gz - | gunzip | pg_restore -d app_new ``` ### S3 versioning + cross-region replication ```hcl resource "aws_s3_bucket_versioning" "main" { bucket = aws_s3_bucket.data.id versioning_configuration { status = "Enabled" } } resource "aws_s3_bucket_replication_configuration" "main" { bucket = aws_s3_bucket.data.id rule { status = "Enabled" destination { bucket = aws_s3_bucket.replica.arn } } } ``` → 실수 삭제도 복구 가능. ### Restore drill (정기) ```yaml # Quarterly cron - name: DR drill schedule: '0 3 1 1,4,7,10 *' # 분기마다 steps: - terraform apply -var=env=dr-test # 새 stack - psql -h dr-test.db -d app < latest_backup.sql - run integration tests on dr-test - measure RTO + RPO - terraform destroy - report to Slack ``` → 실제 복원 안 해보면 의미 없음. ### Runbook 예시 ```markdown # Incident: Primary DB region down ## Detection - CloudWatch alarm: RDS connections = 0 - Pingdom: api.acme.com unreachable ## Action (RTO 30 min) 1. Confirm region issue (AWS Status, Route53) 2. Promote replica: ``` aws rds promote-read-replica --db-instance-identifier db-replica ``` 3. Update DNS: ``` aws route53 change-resource-record-sets --hosted-zone-id Z1 --change-batch file://failover.json ``` 4. Update app config: ``` kubectl set env deployment/api DB_URL=$NEW_DB_URL ``` 5. Verify: smoke test 6. Notify stakeholders ## Postmortem - Within 24h - RPO actual / RTO actual - Root cause ``` ### Multi-region active-active ``` Primary (us-east-1) ←→ Active (eu-west-1) ↓ Cross-region replication Conflict resolution policy ``` 복잡 — Spanner / CockroachDB 가 자연. ### Backup encryption ```bash # 압축 + 암호화 pg_dump -F c db | openssl enc -aes-256-cbc -k $PASS | aws s3 cp - s3://backup/... ``` ### Verify backup (자동) ```bash # 매 backup 후 즉시 restore + 간단 query pg_restore -d test_restore latest.dump psql -d test_restore -c 'SELECT count(*) FROM users' || alarm ``` → "Backup OK" 와 "복원 OK" 다름. ### Retention policy ``` Daily: 30일 Weekly: 12주 Monthly: 12개월 Yearly: 7년 (compliance) ``` ```bash # S3 lifecycle { "Rules": [{ "Status": "Enabled", "Transitions": [ { "Days": 30, "StorageClass": "GLACIER" } ], "Expiration": { "Days": 2555 } // 7년 }] } ``` ## 🤔 의사결정 기준 | 요구 | 추천 | |---|---| | RPO 1h / RTO 1h | 일별 backup + warm standby | | RPO 1min / RTO 5min | Streaming replication + auto-failover | | RPO 0 (financial) | Multi-region active-active | | Compliance backup | S3 Glacier + 7년 | | 단순 SaaS | RDS automated + cross-region | | 큰 enterprise | Multi-cloud DR | ## ❌ 안티패턴 - **Backup 만 — restore 테스트 X**: disaster 시 복원 안 됨. - **Same region 백업**: region down 시 같이. - **Encryption 없음**: backup leak = 데이터 leak. - **Runbook 없음**: 새벽 4시 사람이 우왕좌왕. - **단일 사람 책임**: 그 사람 휴가 = 못 복구. - **DR drill 안 함**: 1년에 1번이라도. - **Retention 없음**: 디스크 폭발. - **Application state 무시**: DB 만 — 다른 system 누락. ## 🤖 LLM 활용 힌트 - RPO + RTO 정의 → 시스템 디자인. - 정기 drill (분기마다). - Runbook 명시 + 자동화. ## 🔗 관련 문서 - [[DevOps_Secrets_Rotation_Automation]] - [[Backend_Geo_Replication]] - [[DB_Read_Replica_Patterns]]