--- id: wiki-2026-0508-high-availability-systems title: High Availability Systems category: 10_Wiki/Topics status: verified canonical_id: self aliases: [HA, high availability, SLO, SLA, redundancy, failover, multi-AZ, multi-region] duplicate_of: none source_trust_level: A confidence_score: 0.96 verification_status: applied tags: [reliability, ha, sre, sla, slo, redundancy, distributed-systems] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Universal framework: Kubernetes / AWS / GCP --- # High Availability Systems ## 매 한 줄 > **"매 service 의 의 의 의 의 fail 의 user 의 의 의 의 의 영향 X"**. 매 9s (3-9, 4-9, 5-9 = 5min/yr). 매 redundancy + failover + 매 cell-based isolation. 매 modern: 매 multi-region active-active, 매 chaos engineering, 매 SLO error budget. ## 매 핵심 ### 매 9s - **99.0%** (2-9): 매 87.6 hr/yr down. - **99.9%** (3-9): 매 8.76 hr/yr. - **99.95%**: 매 4.38 hr/yr. - **99.99%** (4-9): 매 52 min/yr. - **99.999%** (5-9): 매 5.26 min/yr. ### 매 strategy - **Redundancy**: N+1, 2N. - **Failover**: active-passive, active-active. - **Multi-AZ / Multi-region**. - **Cell-based architecture**. - **Circuit breaker**. - **Graceful degradation**. ### 매 응용 1. 매 fintech (ACID). 2. 매 medical (life-critical). 3. 매 e-commerce checkout. 4. 매 SaaS B2B. ## 💻 패턴 ### SLO definition ```yaml service: payments slo: 99.95% window: 30 days indicator: type: availability good: status_code in [200, 201, 204] total: all_requests error_budget_minutes: 21.6 # 매 0.05% of 30 days ``` ### Circuit breaker ```python class CircuitBreaker: def __init__(self, fail_threshold=5, reset_timeout=60): self.failures = 0; self.state = 'closed'; self.opened_at = None self.threshold = fail_threshold; self.timeout = reset_timeout def call(self, fn): if self.state == 'open': if time.time() - self.opened_at > self.timeout: self.state = 'half_open' else: raise CircuitOpen() try: r = fn() if self.state == 'half_open': self.state = 'closed'; self.failures = 0 return r except: self.failures += 1 if self.failures >= self.threshold: self.state = 'open'; self.opened_at = time.time() raise ``` ### Health check ```python @app.get('/health') def health(): return { 'status': 'ok', 'checks': { 'db': check_db(), 'cache': check_redis(), 'queue': check_kafka(), } } ``` ### Multi-AZ DB (RDS) ```yaml RDS: Engine: postgres MultiAZ: true # 매 sync standby BackupRetentionPeriod: 7 DeletionProtection: true ``` ### Active-active multi-region ```typescript // 매 read from local region, write replicate async function readUser(id: string) { return db.local.read(id); // 매 fast } async function writeUser(user: User) { await db.local.write(user); await db.replicate(user); // 매 async to other regions } ``` ### Failover (DNS) ```bash # 매 Route 53 failover aws route53 change-resource-record-sets --change-batch '{ "Changes": [{ "Action": "UPSERT", "ResourceRecordSet": { "Name": "api.example.com", "Type": "A", "SetIdentifier": "primary", "Failover": "PRIMARY", "AliasTarget": {...}, "HealthCheckId": "abc" } }] }' ``` ### Graceful degradation ```typescript async function getRecommendations(userId: string) { try { return await mlService.recommend(userId); // 매 personalized } catch (e) { log.warn('ML down', e); return await getPopularItems(); // 매 cached fallback } } ``` ### Cell-based architecture (AWS) ```yaml # 매 매 cell = 매 isolated 가 service # 매 user 매 hash 의 의 cell 의 routed cells: - cell-1: { region: us-east-1, capacity: 25%, users: hash(uid) % 4 == 0 } - cell-2: { region: us-east-1, capacity: 25%, users: ... == 1 } - cell-3: { region: us-west-2, capacity: 25%, users: ... == 2 } - cell-4: { region: eu-west-1, capacity: 25%, users: ... == 3 } # 매 1 cell 의 fail 매 25% impact only ``` ### Auto-scaling ```yaml autoscaling: min: 2 max: 100 target_cpu: 60 scale_up_cooldown: 60s scale_down_cooldown: 300s ``` ### Bulkhead ```python import asyncio class Bulkhead: def __init__(self, max_concurrent=10): self.sem = asyncio.Semaphore(max_concurrent) async def call(self, coro): async with self.sem: return await coro ``` ### Chaos engineering ```python def chaos_inject(probability=0.01): if random.random() < probability: raise SimulatedFailure('Chaos!') ``` ### Disaster recovery test ```python def quarterly_dr_drill(): primary_db.simulate_failure() assert app.reads_from(replica_db) promote(replica_db) assert app.writes_to(replica_db) rollback() log_drill_results() ``` ### SLO + error budget alert ```yaml - alert: ErrorBudgetBurn expr: | (1 - sum(rate(http_requests_total{status="5xx"}[1h])) / sum(rate(http_requests_total[1h]))) < 0.999 for: 5m annotations: summary: "Burn rate exceeds 14.4x — page on-call" ``` ### Redundancy calculation ```python def availability_redundant(per_node_avail, n_nodes, k_required=1): """매 매 N nodes, 매 K 매 required, 매 each independent.""" from scipy.stats import binom p_fail = 1 - per_node_avail p_at_least_k = 1 - sum(binom.pmf(i, n_nodes, p_fail) for i in range(n_nodes - k_required + 1, n_nodes + 1)) return p_at_least_k ``` ### Load balancer (AWS ALB) ```yaml ALB: Listeners: - Port: 443 Protocol: HTTPS DefaultActions: - Type: forward TargetGroupArn: !Ref TargetGroup HealthCheck: Path: /health Interval: 10 Threshold: 2 ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Critical | Multi-region active-active | | High traffic | Cell-based | | Tight budget | Multi-AZ + auto-scale | | Latency sensitive | Active-active region | | External deps | Circuit breaker + fallback | **기본값**: 매 multi-AZ + 매 auto-scaling + 매 health check + 매 SLO + 매 chaos drill + 매 graceful degradation. ## 🔗 Graph - 부모: [[Reliability]] · [[SRE]] - 변형: [[Multi-Region]] - 응용: [[Failable-Task-Handling]] · [[Distributed-Systems]] - Adjacent: [[Chaos-Engineering]] · [[SLO]] · [[Circuit-Breaker]] ## 🤖 LLM 활용 **언제**: 매 production critical. **언제 X**: 매 internal tool. ## ❌ 안티패턴 - **5-9 SLO without business case**: 매 cost overkill. - **Single AZ "production"**: 매 single point. - **No DR drill**: 매 paper-only HA. - **No graceful degrade**: 매 binary up/down. - **No SLO**: 매 invisible problem. ## 🧪 검증 / 중복 - Verified (Google SRE Book, AWS Well-Architected, Netflix Chaos). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-04-26 | Auto | | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — 9s + 매 SLO / circuit / cell / chaos / failover code |