214 lines
5.3 KiB
Markdown
214 lines
5.3 KiB
Markdown
---
|
|
id: devops-disaster-recovery
|
|
title: Disaster Recovery — RPO / RTO / Backup / Failover
|
|
category: Coding
|
|
status: draft
|
|
source_trust_level: B
|
|
verification_status: conceptual
|
|
created_at: 2026-05-09
|
|
updated_at: 2026-05-09
|
|
tags: [devops, dr, backup, vibe-coding]
|
|
tech_stack: { language: "Bash / SQL / Terraform", applicable_to: ["DevOps"] }
|
|
applied_in: []
|
|
aliases: [disaster recovery, RPO, RTO, backup, restore, failover, runbook]
|
|
---
|
|
|
|
# Disaster Recovery
|
|
|
|
> "Backup 있다" 만으론 부족. **RPO (잃어도 되는 시간) + RTO (복구 시간) 정의 → 정기 복원 테스트**. Region failover, restore drill, runbook.
|
|
|
|
## 📖 핵심 개념
|
|
- RPO: 마지막 백업 시점부터 disaster 사이 잃은 데이터.
|
|
- RTO: disaster → 복구까지 시간.
|
|
- Backup ≠ DR. 복원 가능해야 의미.
|
|
- 3-2-1 rule: 3 copies / 2 medium / 1 offsite.
|
|
|
|
## 💻 코드 패턴
|
|
|
|
### Postgres backup (pg_basebackup + WAL)
|
|
```bash
|
|
# Base backup
|
|
pg_basebackup -D /backup/base -Ft -z -P
|
|
|
|
# WAL archiving (continuous)
|
|
# postgresql.conf
|
|
archive_mode = on
|
|
archive_command = 'aws s3 cp %p s3://wal-archive/%f'
|
|
|
|
# Point-in-time recovery
|
|
# 1. base 복원
|
|
# 2. WAL replay until 특정 timestamp
|
|
restore_command = 'aws s3 cp s3://wal-archive/%f %p'
|
|
recovery_target_time = '2026-05-09 14:00:00'
|
|
```
|
|
|
|
→ RPO 거의 0 (WAL 마지막), RTO 분 단위.
|
|
|
|
### AWS RDS automated
|
|
```hcl
|
|
resource "aws_db_instance" "main" {
|
|
backup_retention_period = 30 # 30일
|
|
backup_window = "03:00-04:00"
|
|
delete_automated_backups = false
|
|
copy_tags_to_snapshot = true
|
|
deletion_protection = true
|
|
}
|
|
|
|
# Cross-region replica
|
|
resource "aws_db_instance" "replica" {
|
|
replicate_source_db = aws_db_instance.main.identifier
|
|
region = "us-west-2"
|
|
}
|
|
```
|
|
|
|
### Postgres logical backup
|
|
```bash
|
|
# Daily
|
|
pg_dump -F c -d app | gzip | aws s3 cp - s3://backup/db-$(date +%Y%m%d).dump.gz
|
|
|
|
# Restore
|
|
aws s3 cp s3://backup/db-2026-05-09.dump.gz - | gunzip | pg_restore -d app_new
|
|
```
|
|
|
|
### S3 versioning + cross-region replication
|
|
```hcl
|
|
resource "aws_s3_bucket_versioning" "main" {
|
|
bucket = aws_s3_bucket.data.id
|
|
versioning_configuration { status = "Enabled" }
|
|
}
|
|
|
|
resource "aws_s3_bucket_replication_configuration" "main" {
|
|
bucket = aws_s3_bucket.data.id
|
|
rule {
|
|
status = "Enabled"
|
|
destination { bucket = aws_s3_bucket.replica.arn }
|
|
}
|
|
}
|
|
```
|
|
|
|
→ 실수 삭제도 복구 가능.
|
|
|
|
### Restore drill (정기)
|
|
```yaml
|
|
# Quarterly cron
|
|
- name: DR drill
|
|
schedule: '0 3 1 1,4,7,10 *' # 분기마다
|
|
steps:
|
|
- terraform apply -var=env=dr-test # 새 stack
|
|
- psql -h dr-test.db -d app < latest_backup.sql
|
|
- run integration tests on dr-test
|
|
- measure RTO + RPO
|
|
- terraform destroy
|
|
- report to Slack
|
|
```
|
|
|
|
→ 실제 복원 안 해보면 의미 없음.
|
|
|
|
### Runbook 예시
|
|
```markdown
|
|
# Incident: Primary DB region down
|
|
|
|
## Detection
|
|
- CloudWatch alarm: RDS connections = 0
|
|
- Pingdom: api.acme.com unreachable
|
|
|
|
## Action (RTO 30 min)
|
|
|
|
1. Confirm region issue (AWS Status, Route53)
|
|
2. Promote replica:
|
|
```
|
|
aws rds promote-read-replica --db-instance-identifier db-replica
|
|
```
|
|
3. Update DNS:
|
|
```
|
|
aws route53 change-resource-record-sets --hosted-zone-id Z1 --change-batch file://failover.json
|
|
```
|
|
4. Update app config:
|
|
```
|
|
kubectl set env deployment/api DB_URL=$NEW_DB_URL
|
|
```
|
|
5. Verify: smoke test
|
|
6. Notify stakeholders
|
|
|
|
## Postmortem
|
|
- Within 24h
|
|
- RPO actual / RTO actual
|
|
- Root cause
|
|
```
|
|
|
|
### Multi-region active-active
|
|
```
|
|
Primary (us-east-1) ←→ Active (eu-west-1)
|
|
↓
|
|
Cross-region replication
|
|
Conflict resolution policy
|
|
```
|
|
|
|
복잡 — Spanner / CockroachDB 가 자연.
|
|
|
|
### Backup encryption
|
|
```bash
|
|
# 압축 + 암호화
|
|
pg_dump -F c db | openssl enc -aes-256-cbc -k $PASS | aws s3 cp - s3://backup/...
|
|
```
|
|
|
|
### Verify backup (자동)
|
|
```bash
|
|
# 매 backup 후 즉시 restore + 간단 query
|
|
pg_restore -d test_restore latest.dump
|
|
psql -d test_restore -c 'SELECT count(*) FROM users' || alarm
|
|
```
|
|
|
|
→ "Backup OK" 와 "복원 OK" 다름.
|
|
|
|
### Retention policy
|
|
```
|
|
Daily: 30일
|
|
Weekly: 12주
|
|
Monthly: 12개월
|
|
Yearly: 7년 (compliance)
|
|
```
|
|
|
|
```bash
|
|
# S3 lifecycle
|
|
{
|
|
"Rules": [{
|
|
"Status": "Enabled",
|
|
"Transitions": [
|
|
{ "Days": 30, "StorageClass": "GLACIER" }
|
|
],
|
|
"Expiration": { "Days": 2555 } // 7년
|
|
}]
|
|
}
|
|
```
|
|
|
|
## 🤔 의사결정 기준
|
|
| 요구 | 추천 |
|
|
|---|---|
|
|
| RPO 1h / RTO 1h | 일별 backup + warm standby |
|
|
| RPO 1min / RTO 5min | Streaming replication + auto-failover |
|
|
| RPO 0 (financial) | Multi-region active-active |
|
|
| Compliance backup | S3 Glacier + 7년 |
|
|
| 단순 SaaS | RDS automated + cross-region |
|
|
| 큰 enterprise | Multi-cloud DR |
|
|
|
|
## ❌ 안티패턴
|
|
- **Backup 만 — restore 테스트 X**: disaster 시 복원 안 됨.
|
|
- **Same region 백업**: region down 시 같이.
|
|
- **Encryption 없음**: backup leak = 데이터 leak.
|
|
- **Runbook 없음**: 새벽 4시 사람이 우왕좌왕.
|
|
- **단일 사람 책임**: 그 사람 휴가 = 못 복구.
|
|
- **DR drill 안 함**: 1년에 1번이라도.
|
|
- **Retention 없음**: 디스크 폭발.
|
|
- **Application state 무시**: DB 만 — 다른 system 누락.
|
|
|
|
## 🤖 LLM 활용 힌트
|
|
- RPO + RTO 정의 → 시스템 디자인.
|
|
- 정기 drill (분기마다).
|
|
- Runbook 명시 + 자동화.
|
|
|
|
## 🔗 관련 문서
|
|
- [[DevOps_Secrets_Rotation_Automation]]
|
|
- [[Backend_Geo_Replication]]
|
|
- [[DB_Read_Replica_Patterns]]
|