Files
2nd/10_Wiki/Topics/Coding/DevOps_Disaster_Recovery.md
T
2026-05-09 21:08:02 +09:00

5.3 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
devops-disaster-recovery Disaster Recovery — RPO / RTO / Backup / Failover Coding draft B conceptual 2026-05-09 2026-05-09
devops
dr
backup
vibe-coding
language applicable_to
Bash / SQL / Terraform
DevOps
disaster recovery
RPO
RTO
backup
restore
failover
runbook

Disaster Recovery

"Backup 있다" 만으론 부족. RPO (잃어도 되는 시간) + RTO (복구 시간) 정의 → 정기 복원 테스트. Region failover, restore drill, runbook.

📖 핵심 개념

  • RPO: 마지막 백업 시점부터 disaster 사이 잃은 데이터.
  • RTO: disaster → 복구까지 시간.
  • Backup ≠ DR. 복원 가능해야 의미.
  • 3-2-1 rule: 3 copies / 2 medium / 1 offsite.

💻 코드 패턴

Postgres backup (pg_basebackup + WAL)

# Base backup
pg_basebackup -D /backup/base -Ft -z -P

# WAL archiving (continuous)
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://wal-archive/%f'

# Point-in-time recovery
# 1. base 복원
# 2. WAL replay until 특정 timestamp
restore_command = 'aws s3 cp s3://wal-archive/%f %p'
recovery_target_time = '2026-05-09 14:00:00'

→ RPO 거의 0 (WAL 마지막), RTO 분 단위.

AWS RDS automated

resource "aws_db_instance" "main" {
  backup_retention_period = 30        # 30일
  backup_window           = "03:00-04:00"
  delete_automated_backups = false
  copy_tags_to_snapshot   = true
  deletion_protection     = true
}

# Cross-region replica
resource "aws_db_instance" "replica" {
  replicate_source_db = aws_db_instance.main.identifier
  region              = "us-west-2"
}

Postgres logical backup

# Daily
pg_dump -F c -d app | gzip | aws s3 cp - s3://backup/db-$(date +%Y%m%d).dump.gz

# Restore
aws s3 cp s3://backup/db-2026-05-09.dump.gz - | gunzip | pg_restore -d app_new

S3 versioning + cross-region replication

resource "aws_s3_bucket_versioning" "main" {
  bucket = aws_s3_bucket.data.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_replication_configuration" "main" {
  bucket = aws_s3_bucket.data.id
  rule {
    status = "Enabled"
    destination { bucket = aws_s3_bucket.replica.arn }
  }
}

→ 실수 삭제도 복구 가능.

Restore drill (정기)

# Quarterly cron
- name: DR drill
  schedule: '0 3 1 1,4,7,10 *'  # 분기마다
  steps:
    - terraform apply -var=env=dr-test  # 새 stack
    - psql -h dr-test.db -d app < latest_backup.sql
    - run integration tests on dr-test
    - measure RTO + RPO
    - terraform destroy
    - report to Slack

→ 실제 복원 안 해보면 의미 없음.

Runbook 예시

# Incident: Primary DB region down

## Detection
- CloudWatch alarm: RDS connections = 0
- Pingdom: api.acme.com unreachable

## Action (RTO 30 min)

1. Confirm region issue (AWS Status, Route53)
2. Promote replica:

aws rds promote-read-replica --db-instance-identifier db-replica

3. Update DNS:

aws route53 change-resource-record-sets --hosted-zone-id Z1 --change-batch file://failover.json

4. Update app config:

kubectl set env deployment/api DB_URL=$NEW_DB_URL

5. Verify: smoke test
6. Notify stakeholders

## Postmortem
- Within 24h
- RPO actual / RTO actual
- Root cause

Multi-region active-active

Primary (us-east-1)  ←→ Active (eu-west-1)
              ↓
       Cross-region replication
       Conflict resolution policy

복잡 — Spanner / CockroachDB 가 자연.

Backup encryption

# 압축 + 암호화
pg_dump -F c db | openssl enc -aes-256-cbc -k $PASS | aws s3 cp - s3://backup/...

Verify backup (자동)

# 매 backup 후 즉시 restore + 간단 query
pg_restore -d test_restore latest.dump
psql -d test_restore -c 'SELECT count(*) FROM users' || alarm

→ "Backup OK" 와 "복원 OK" 다름.

Retention policy

Daily: 30일
Weekly: 12주
Monthly: 12개월
Yearly: 7년 (compliance)
# S3 lifecycle
{
  "Rules": [{
    "Status": "Enabled",
    "Transitions": [
      { "Days": 30, "StorageClass": "GLACIER" }
    ],
    "Expiration": { "Days": 2555 }  // 7년
  }]
}

🤔 의사결정 기준

요구 추천
RPO 1h / RTO 1h 일별 backup + warm standby
RPO 1min / RTO 5min Streaming replication + auto-failover
RPO 0 (financial) Multi-region active-active
Compliance backup S3 Glacier + 7년
단순 SaaS RDS automated + cross-region
큰 enterprise Multi-cloud DR

안티패턴

  • Backup 만 — restore 테스트 X: disaster 시 복원 안 됨.
  • Same region 백업: region down 시 같이.
  • Encryption 없음: backup leak = 데이터 leak.
  • Runbook 없음: 새벽 4시 사람이 우왕좌왕.
  • 단일 사람 책임: 그 사람 휴가 = 못 복구.
  • DR drill 안 함: 1년에 1번이라도.
  • Retention 없음: 디스크 폭발.
  • Application state 무시: DB 만 — 다른 system 누락.

🤖 LLM 활용 힌트

  • RPO + RTO 정의 → 시스템 디자인.
  • 정기 drill (분기마다).
  • Runbook 명시 + 자동화.

🔗 관련 문서