Files
2nd/10_Wiki/Topics/AI_and_ML/High-Availability-Systems.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.9 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-high-availability-systems High Availability Systems 10_Wiki/Topics verified self
HA
high availability
SLO
SLA
redundancy
failover
multi-AZ
multi-region
none A 0.96 applied
reliability
ha
sre
sla
slo
redundancy
distributed-systems
2026-05-10 pending
language framework
Universal Kubernetes / AWS / GCP

High Availability Systems

매 한 줄

"매 service 의 의 의 의 의 fail 의 user 의 의 의 의 의 영향 X". 매 9s (3-9, 4-9, 5-9 = 5min/yr). 매 redundancy + failover + 매 cell-based isolation. 매 modern: 매 multi-region active-active, 매 chaos engineering, 매 SLO error budget.

매 핵심

매 9s

  • 99.0% (2-9): 매 87.6 hr/yr down.
  • 99.9% (3-9): 매 8.76 hr/yr.
  • 99.95%: 매 4.38 hr/yr.
  • 99.99% (4-9): 매 52 min/yr.
  • 99.999% (5-9): 매 5.26 min/yr.

매 strategy

  • Redundancy: N+1, 2N.
  • Failover: active-passive, active-active.
  • Multi-AZ / Multi-region.
  • Cell-based architecture.
  • Circuit breaker.
  • Graceful degradation.

매 응용

  1. 매 fintech (ACID).
  2. 매 medical (life-critical).
  3. 매 e-commerce checkout.
  4. 매 SaaS B2B.

💻 패턴

SLO definition

service: payments
slo: 99.95%
window: 30 days
indicator:
  type: availability
  good: status_code in [200, 201, 204]
  total: all_requests
error_budget_minutes: 21.6  # 매 0.05% of 30 days

Circuit breaker

class CircuitBreaker:
    def __init__(self, fail_threshold=5, reset_timeout=60):
        self.failures = 0; self.state = 'closed'; self.opened_at = None
        self.threshold = fail_threshold; self.timeout = reset_timeout
    
    def call(self, fn):
        if self.state == 'open':
            if time.time() - self.opened_at > self.timeout:
                self.state = 'half_open'
            else: raise CircuitOpen()
        try:
            r = fn()
            if self.state == 'half_open': self.state = 'closed'; self.failures = 0
            return r
        except:
            self.failures += 1
            if self.failures >= self.threshold:
                self.state = 'open'; self.opened_at = time.time()
            raise

Health check

@app.get('/health')
def health():
    return {
        'status': 'ok',
        'checks': {
            'db': check_db(),
            'cache': check_redis(),
            'queue': check_kafka(),
        }
    }

Multi-AZ DB (RDS)

RDS:
  Engine: postgres
  MultiAZ: true  # 매 sync standby
  BackupRetentionPeriod: 7
  DeletionProtection: true

Active-active multi-region

// 매 read from local region, write replicate
async function readUser(id: string) {
  return db.local.read(id);  // 매 fast
}

async function writeUser(user: User) {
  await db.local.write(user);
  await db.replicate(user);  // 매 async to other regions
}

Failover (DNS)

# 매 Route 53 failover
aws route53 change-resource-record-sets --change-batch '{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "api.example.com",
      "Type": "A",
      "SetIdentifier": "primary",
      "Failover": "PRIMARY",
      "AliasTarget": {...},
      "HealthCheckId": "abc"
    }
  }]
}'

Graceful degradation

async function getRecommendations(userId: string) {
  try {
    return await mlService.recommend(userId);  // 매 personalized
  } catch (e) {
    log.warn('ML down', e);
    return await getPopularItems();  // 매 cached fallback
  }
}

Cell-based architecture (AWS)

# 매 매 cell = 매 isolated 가 service
# 매 user 매 hash 의 의 cell 의 routed
cells:
  - cell-1: { region: us-east-1, capacity: 25%, users: hash(uid) % 4 == 0 }
  - cell-2: { region: us-east-1, capacity: 25%, users: ... == 1 }
  - cell-3: { region: us-west-2, capacity: 25%, users: ... == 2 }
  - cell-4: { region: eu-west-1, capacity: 25%, users: ... == 3 }
# 매 1 cell 의 fail 매 25% impact only

Auto-scaling

autoscaling:
  min: 2
  max: 100
  target_cpu: 60
  scale_up_cooldown: 60s
  scale_down_cooldown: 300s

Bulkhead

import asyncio
class Bulkhead:
    def __init__(self, max_concurrent=10):
        self.sem = asyncio.Semaphore(max_concurrent)
    async def call(self, coro):
        async with self.sem:
            return await coro

Chaos engineering

def chaos_inject(probability=0.01):
    if random.random() < probability:
        raise SimulatedFailure('Chaos!')

Disaster recovery test

def quarterly_dr_drill():
    primary_db.simulate_failure()
    assert app.reads_from(replica_db)
    promote(replica_db)
    assert app.writes_to(replica_db)
    rollback()
    log_drill_results()

SLO + error budget alert

- alert: ErrorBudgetBurn
  expr: |
    (1 - sum(rate(http_requests_total{status="5xx"}[1h])) / sum(rate(http_requests_total[1h])))
    < 0.999
  for: 5m
  annotations:
    summary: "Burn rate exceeds 14.4x — page on-call"

Redundancy calculation

def availability_redundant(per_node_avail, n_nodes, k_required=1):
    """매 매 N nodes, 매 K 매 required, 매 each independent."""
    from scipy.stats import binom
    p_fail = 1 - per_node_avail
    p_at_least_k = 1 - sum(binom.pmf(i, n_nodes, p_fail) for i in range(n_nodes - k_required + 1, n_nodes + 1))
    return p_at_least_k

Load balancer (AWS ALB)

ALB:
  Listeners:
    - Port: 443
      Protocol: HTTPS
      DefaultActions:
        - Type: forward
          TargetGroupArn: !Ref TargetGroup
  HealthCheck:
    Path: /health
    Interval: 10
    Threshold: 2

매 결정 기준

상황 Approach
Critical Multi-region active-active
High traffic Cell-based
Tight budget Multi-AZ + auto-scale
Latency sensitive Active-active region
External deps Circuit breaker + fallback

기본값: 매 multi-AZ + 매 auto-scaling + 매 health check + 매 SLO + 매 chaos drill + 매 graceful degradation.

🔗 Graph

🤖 LLM 활용

언제: 매 production critical. 언제 X: 매 internal tool.

안티패턴

  • 5-9 SLO without business case: 매 cost overkill.
  • Single AZ "production": 매 single point.
  • No DR drill: 매 paper-only HA.
  • No graceful degrade: 매 binary up/down.
  • No SLO: 매 invisible problem.

🧪 검증 / 중복

  • Verified (Google SRE Book, AWS Well-Architected, Netflix Chaos).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-04-26 Auto
2026-05-08 Phase 1
2026-05-10 Manual cleanup — 9s + 매 SLO / circuit / cell / chaos / failover code