[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,213 @@
---
id: devops-disaster-recovery
title: Disaster Recovery — RPO / RTO / Backup / Failover
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [devops, dr, backup, vibe-coding]
tech_stack: { language: "Bash / SQL / Terraform", applicable_to: ["DevOps"] }
applied_in: []
aliases: [disaster recovery, RPO, RTO, backup, restore, failover, runbook]
---
# Disaster Recovery
> "Backup 있다" 만으론 부족. **RPO (잃어도 되는 시간) + RTO (복구 시간) 정의 → 정기 복원 테스트**. Region failover, restore drill, runbook.
## 📖 핵심 개념
- RPO: 마지막 백업 시점부터 disaster 사이 잃은 데이터.
- RTO: disaster → 복구까지 시간.
- Backup ≠ DR. 복원 가능해야 의미.
- 3-2-1 rule: 3 copies / 2 medium / 1 offsite.
## 💻 코드 패턴
### Postgres backup (pg_basebackup + WAL)
```bash
# Base backup
pg_basebackup -D /backup/base -Ft -z -P
# WAL archiving (continuous)
# postgresql.conf
archive_mode = on
archive_command = 'aws s3 cp %p s3://wal-archive/%f'
# Point-in-time recovery
# 1. base 복원
# 2. WAL replay until 특정 timestamp
restore_command = 'aws s3 cp s3://wal-archive/%f %p'
recovery_target_time = '2026-05-09 14:00:00'
```
→ RPO 거의 0 (WAL 마지막), RTO 분 단위.
### AWS RDS automated
```hcl
resource "aws_db_instance" "main" {
backup_retention_period = 30 # 30일
backup_window = "03:00-04:00"
delete_automated_backups = false
copy_tags_to_snapshot = true
deletion_protection = true
}
# Cross-region replica
resource "aws_db_instance" "replica" {
replicate_source_db = aws_db_instance.main.identifier
region = "us-west-2"
}
```
### Postgres logical backup
```bash
# Daily
pg_dump -F c -d app | gzip | aws s3 cp - s3://backup/db-$(date +%Y%m%d).dump.gz
# Restore
aws s3 cp s3://backup/db-2026-05-09.dump.gz - | gunzip | pg_restore -d app_new
```
### S3 versioning + cross-region replication
```hcl
resource "aws_s3_bucket_versioning" "main" {
bucket = aws_s3_bucket.data.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_replication_configuration" "main" {
bucket = aws_s3_bucket.data.id
rule {
status = "Enabled"
destination { bucket = aws_s3_bucket.replica.arn }
}
}
```
→ 실수 삭제도 복구 가능.
### Restore drill (정기)
```yaml
# Quarterly cron
- name: DR drill
schedule: '0 3 1 1,4,7,10 *' # 분기마다
steps:
- terraform apply -var=env=dr-test # 새 stack
- psql -h dr-test.db -d app < latest_backup.sql
- run integration tests on dr-test
- measure RTO + RPO
- terraform destroy
- report to Slack
```
→ 실제 복원 안 해보면 의미 없음.
### Runbook 예시
```markdown
# Incident: Primary DB region down
## Detection
- CloudWatch alarm: RDS connections = 0
- Pingdom: api.acme.com unreachable
## Action (RTO 30 min)
1. Confirm region issue (AWS Status, Route53)
2. Promote replica:
```
aws rds promote-read-replica --db-instance-identifier db-replica
```
3. Update DNS:
```
aws route53 change-resource-record-sets --hosted-zone-id Z1 --change-batch file://failover.json
```
4. Update app config:
```
kubectl set env deployment/api DB_URL=$NEW_DB_URL
```
5. Verify: smoke test
6. Notify stakeholders
## Postmortem
- Within 24h
- RPO actual / RTO actual
- Root cause
```
### Multi-region active-active
```
Primary (us-east-1) ←→ Active (eu-west-1)
Cross-region replication
Conflict resolution policy
```
복잡 — Spanner / CockroachDB 가 자연.
### Backup encryption
```bash
# 압축 + 암호화
pg_dump -F c db | openssl enc -aes-256-cbc -k $PASS | aws s3 cp - s3://backup/...
```
### Verify backup (자동)
```bash
# 매 backup 후 즉시 restore + 간단 query
pg_restore -d test_restore latest.dump
psql -d test_restore -c 'SELECT count(*) FROM users' || alarm
```
→ "Backup OK" 와 "복원 OK" 다름.
### Retention policy
```
Daily: 30일
Weekly: 12주
Monthly: 12개월
Yearly: 7년 (compliance)
```
```bash
# S3 lifecycle
{
"Rules": [{
"Status": "Enabled",
"Transitions": [
{ "Days": 30, "StorageClass": "GLACIER" }
],
"Expiration": { "Days": 2555 } // 7년
}]
}
```
## 🤔 의사결정 기준
| 요구 | 추천 |
|---|---|
| RPO 1h / RTO 1h | 일별 backup + warm standby |
| RPO 1min / RTO 5min | Streaming replication + auto-failover |
| RPO 0 (financial) | Multi-region active-active |
| Compliance backup | S3 Glacier + 7년 |
| 단순 SaaS | RDS automated + cross-region |
| 큰 enterprise | Multi-cloud DR |
## ❌ 안티패턴
- **Backup 만 — restore 테스트 X**: disaster 시 복원 안 됨.
- **Same region 백업**: region down 시 같이.
- **Encryption 없음**: backup leak = 데이터 leak.
- **Runbook 없음**: 새벽 4시 사람이 우왕좌왕.
- **단일 사람 책임**: 그 사람 휴가 = 못 복구.
- **DR drill 안 함**: 1년에 1번이라도.
- **Retention 없음**: 디스크 폭발.
- **Application state 무시**: DB 만 — 다른 system 누락.
## 🤖 LLM 활용 힌트
- RPO + RTO 정의 → 시스템 디자인.
- 정기 drill (분기마다).
- Runbook 명시 + 자동화.
## 🔗 관련 문서
- [[DevOps_Secrets_Rotation_Automation]]
- [[Backend_Geo_Replication]]
- [[DB_Read_Replica_Patterns]]