--- id: wiki-2026-0508-availability-and-persistence title: Availability and Persistence category: 10_Wiki/Topics status: verified canonical_id: self aliases: [HA, durability, ACID, replication, SLA, 99.999, distributed system, RPO, RTO] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [availability, persistence, distributed-systems, replication, sla, acid, durability, rpo-rto, sre] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: distributed systems framework: Kubernetes / Postgres / Kafka / S3 --- # Availability and Persistence ## 📌 한 줄 통찰 > **"매 always there + 매 never forget"**. 매 availability = 매 즉시 응답 가능. 매 persistence (durability) = 매 한번 commit 의 절대 lose X. 매 distributed system 의 두 base. 매 SLA 의 currency. ## 📖 핵심 ### 매 Availability (가용성) - 매 system 의 의도된 service 의 가능 시간 비율. - 매 measure: uptime / total time. | Nines | Downtime / year | |---|---| | 99% | 3.65 일 | | 99.9% (3 nines) | 8.76 시간 | | 99.99% (4 nines) | 52.6 분 | | 99.999% (5 nines) | 5.26 분 | | 99.9999% (6 nines) | 31.5 초 | → 매 nines 의 매 cost 의 exponential. ### 매 Durability (지속성) - 매 commit 후 의 data 의 lose 의 probability. - 매 S3: 11 nines (99.999999999%). - 매 disk MTBF: 매 100 만 hour. ### 매 RPO / RTO - **RPO** (Recovery Point Objective): 매 잃을 수 있는 data 의 max age. - **RTO** (Recovery Time Objective): 매 service restore 까지의 max time. | RPO/RTO | 매 strategy | |---|---| | 0 / 0 | 매 sync replication, multi-region | | min / min | 매 hot standby | | hour / hour | 매 daily backup | | day / day | 매 cold backup | ### 매 Availability 의 design #### Redundancy - 매 N+1 / N+2 (active-passive / active-active). - 매 multi-AZ / multi-region. - 매 load balancer + health check. #### Fault tolerance - 매 graceful degradation. - 매 circuit breaker. - 매 bulkhead. - 매 retry with backoff. #### Auto-recovery - 매 self-healing (k8s). - 매 auto-scaling. - 매 chaos engineering 의 verify. ### 매 Persistence 의 design #### ACID (RDBMS) - **Atomicity**: 매 all-or-nothing. - **Consistency**: 매 invariant 보존. - **Isolation**: 매 concurrent ↛ 매 interference. - **Durability**: 매 commit 의 persistent. #### Replication - **Sync**: 매 N replica 의 ack 후 commit (latency cost). - **Async**: 매 leader commit 후 propagate (data loss risk). - **Quorum** (Paxos / Raft): 매 majority ack. #### Backup - **Full / incremental / differential**. - **3-2-1 rule**: 3 copies, 2 different media, 1 offsite. - **Test restore** (매 critical, 매 자주 무시). #### Storage tier - **Hot** (S3 Standard): 매 ms access. - **Warm** (Standard-IA): 매 cheaper, 매 retrieval fee. - **Cold** (Glacier): 매 hours retrieval. - **Deep archive**: 매 12 hour, 매 cheapest. ### 매 CAP / PACELC - **CAP**: Consistency + Availability + Partition tolerance — 매 2 만 pick. - **PACELC**: 매 partition 시 PA / PC, 매 else EL / EC. ### 매 modern best practice 1. **Multi-AZ / multi-region** (depending on cost). 2. **Health check + auto-failover**. 3. **Database replica + read slave**. 4. **CDN / cache** (availability proxy). 5. **Backup + test restore**. 6. **SLO / SLI / error budget** (Google SRE). 7. **Chaos engineering**. 8. **Postmortem culture**. ## 💻 패턴 ### Health check ```yaml # k8s deployment livenessProbe: httpGet: { path: /health, port: 8080 } initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: { path: /ready, port: 8080 } periodSeconds: 5 ``` ### Circuit breaker (retry 한도) ```ts class CircuitBreaker { state: 'closed' | 'open' | 'half-open' = 'closed'; failures = 0; lastFailure = 0; async call(fn: () => Promise): Promise { if (this.state === 'open') { if (Date.now() - this.lastFailure > 30_000) this.state = 'half-open'; else throw new ServiceUnavailable(); } try { const result = await fn(); this.state = 'closed'; this.failures = 0; return result; } catch (e) { this.failures++; this.lastFailure = Date.now(); if (this.failures >= 5) this.state = 'open'; throw e; } } } ``` ### Postgres replication (sync) ```sql -- 매 primary ALTER SYSTEM SET synchronous_standby_names = 'replica1, replica2'; ALTER SYSTEM SET synchronous_commit = on; SELECT pg_reload_conf(); -- 매 replica 의 streaming replication 의 시작 -- 매 transaction 의 commit 의 매 replica ack 후. ``` ### S3 lifecycle (storage tier) ```json { "Rules": [{ "Status": "Enabled", "Transitions": [ { "Days": 30, "StorageClass": "STANDARD_IA" }, { "Days": 90, "StorageClass": "GLACIER" }, { "Days": 365, "StorageClass": "DEEP_ARCHIVE" } ], "Expiration": { "Days": 2555 } // 7 years }] } ``` ### SLO / Error budget ```python def error_budget(sli_target=0.999, period_days=30): """매 SLI 의 99.9% → 매 0.1% 의 error budget.""" total_minutes = period_days * 24 * 60 budget = total_minutes * (1 - sli_target) return budget # 매 분 def burn_rate(actual_errors, budget, elapsed_fraction): expected = budget * elapsed_fraction return actual_errors / expected if expected > 0 else 0 # burn_rate > 1 → 매 budget 의 빠르게 burn. # burn_rate > 14.4 → 매 critical (1 hour 에 1 day budget). ``` ### Backup test restore ```bash #!/bin/bash # 매 매주 자동 restore test LATEST=$(aws s3 ls s3://backups/db/ | tail -1 | awk '{print $4}') aws s3 cp "s3://backups/db/$LATEST" /tmp/ # 매 staging DB 의 restore pg_restore -d staging_test /tmp/$LATEST # 매 sample query 의 verify psql staging_test -c "SELECT count(*) FROM users;" > /tmp/result diff /tmp/result expected.txt || alert "Backup restore failed!" ``` → 매 backup 의 가치 = 매 restore 의 verify. ### Multi-region failover (DNS) ```python # 매 Route53 health check + failover routing { 'primary': {'region': 'us-east-1', 'health_check': 'http://primary/health'}, 'secondary': {'region': 'us-west-2', 'health_check': 'http://secondary/health'}, 'failover': 'PRIMARY_FAILS_TO_SECONDARY', } ``` ### Distributed lock (Redis Redlock) ```python import redis import time import uuid def acquire_lock(client, key, ttl=10000): token = str(uuid.uuid4()) if client.set(key, token, nx=True, px=ttl): return token return None def release_lock(client, key, token): script = """ if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) end return 0 """ return client.eval(script, 1, key, token) ``` ## 🤔 결정 기준 | 요구 | Strategy | |---|---| | 99.9% (3 nines) | Multi-AZ + auto-failover | | 99.99% (4 nines) | Multi-region + sync replica | | 99.999% (5 nines) | Active-active multi-region + chaos | | Critical durability | S3 + cross-region replication | | Long-term archive | Glacier Deep Archive | | Hot path | RDS + read replica + cache | | Eventual OK | DynamoDB + async | **기본값**: Multi-AZ + replica + backup test + SLO + chaos. ## 🔗 Graph - 부모: [[Distributed-Systems]] · [[SRE]] · [[Reliability]] - 변형: [[High-Availability]] · [[Durability]] · [[Replication]] · [[Backup-Strategy]] - 응용: [[ACID]] · [[CAP-Theorem]] · [[PACELC]] · [[Raft]] · [[Paxos]] - 응용 (cloud): [[Multi-Region]] · [[Chaos-Engineering]] - Adjacent: [[Circuit-Breaker]] · [[Postmortem]] ## 🤖 LLM 활용 **언제**: 매 system design. 매 SLA negotiation. 매 incident response. 매 backup strategy review. **언제 X**: 매 prototype (over-engineering). 매 single-user app. ## ❌ 안티패턴 - **No backup test**: 매 fake durability. - **5-nines 의 demand 의 single-region**: 매 impossible. - **Sync replication cross-region** (high latency): 매 user 의 slow. - **Health check 의 deep dependency**: 매 cascade. - **Retry without backoff**: 매 thundering herd. - **No SLO**: 매 over-engineer or 매 under-deliver. - **Single point of failure**: 매 invisible. ## 🧪 검증 / 중복 - Verified (Google SRE book, AWS Well-Architected, CAP / PACELC). - 신뢰도 A. - Related: [[CAP-Theorem]] · [[Replication]] · [[SLO-SLI]] · [[Chaos-Engineering]] · [[ACID]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — nines + RPO/RTO + replication + SLO + 매 K8s / Postgres / S3 / Redis code |