--- id: wiki-2026-0508-fault-tolerance title: Fault Tolerance category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Fault Tolerance, 장애 내성, Resilience Engineering] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [architecture, distributed-systems, resilience, erlang] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: erlang framework: otp --- # Fault Tolerance ## 매 한 줄 > **"매 system은 fail한다 — 매 question은 'when'이지 'if' 아님"**. 매 fault tolerance는 component failure에도 system이 계속 동작하도록 design — Erlang/OTP의 "let it crash" philosophy에서 modern Kubernetes self-healing까지 evolution. 2026 cloud-native에서는 chaos engineering, circuit breaker, bulkhead가 default. ## 매 핵심 ### 매 Fault vs Error vs Failure - **Fault**: 매 root cause (bug, hardware glitch, network partition) - **Error**: 매 fault의 manifestation (incorrect state) - **Failure**: 매 service가 contract 위반 (user-visible) - 매 goal: fault → error containment, error → failure prevention ### 매 Erlang Philosophy - **Let it crash**: 매 defensive coding 대신 supervisor가 restart - **Process isolation**: 매 lightweight process per actor, shared-nothing - **Hot code reload**: 매 zero-downtime upgrade - 매 WhatsApp이 2 billion users를 50 engineers로 운영한 비결 ### 매 응용 1. Erlang/OTP supervisor tree (telecom, WhatsApp, Discord). 2. Kubernetes pod restart + liveness probes. 3. Circuit breaker (Hystrix, resilience4j). 4. Distributed databases (Cassandra hinted handoff, Spanner). ## 💻 패턴 ### Erlang Supervisor Tree ```erlang -module(my_sup). -behaviour(supervisor). -export([start_link/0, init/1]). start_link() -> supervisor:start_link({local, ?MODULE}, ?MODULE, []). init([]) -> SupFlags = #{strategy => one_for_one, intensity => 5, period => 10}, Children = [ #{id => worker1, start => {worker, start_link, []}, restart => permanent, shutdown => 5000, type => worker} ], {ok, {SupFlags, Children}}. ``` ### Circuit Breaker (Python) ```python from pybreaker import CircuitBreaker db_breaker = CircuitBreaker(fail_max=5, reset_timeout=60) @db_breaker def query_db(sql: str): return db.execute(sql) try: result = query_db("SELECT * FROM users") except CircuitBreakerError: return cached_response() # fallback ``` ### Retry with Exponential Backoff ```python import asyncio import random async def retry_with_backoff(fn, max_retries=5, base=1.0): for attempt in range(max_retries): try: return await fn() except Exception as e: if attempt == max_retries - 1: raise delay = base * (2 ** attempt) + random.uniform(0, 1) await asyncio.sleep(delay) ``` ### Bulkhead Pattern (Go) ```go import "golang.org/x/sync/semaphore" type Service struct { dbSem *semaphore.Weighted // 10 concurrent DB calls apiSem *semaphore.Weighted // 50 concurrent API calls } func (s *Service) CallDB(ctx context.Context) error { if err := s.dbSem.Acquire(ctx, 1); err != nil { return err } defer s.dbSem.Release(1) return doDBWork() } ``` ### Kubernetes Liveness/Readiness ```yaml apiVersion: v1 kind: Pod spec: containers: - name: app livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 15 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 periodSeconds: 5 ``` ### Chaos Engineering (Litmus) ```yaml apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine spec: experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: '60' - name: PODS_AFFECTED_PERC value: '50' ``` ### Saga Pattern (Compensation) ```python class OrderSaga: async def execute(self, order): steps = [] try: payment = await charge_card(order) steps.append(("refund", payment.id)) inventory = await reserve_stock(order) steps.append(("release", inventory.id)) await ship_order(order) except Exception: for action, ref in reversed(steps): await compensate(action, ref) raise ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Telecom-grade uptime (5 nines) | Erlang/OTP supervisor tree | | Microservices REST | Circuit breaker + retry + timeout | | Stateful distributed DB | Quorum + hinted handoff | | Container orchestration | K8s liveness/readiness + PodDisruptionBudget | | Cross-service transactions | Saga + compensation | **기본값**: 매 timeout + retry + circuit breaker 3종 세트 + chaos testing. ## 🔗 Graph - 부모: [[Distributed Systems]] - 변형: [[Circuit Breaker]] - 응용: [[Kubernetes]] · [[Microservices]] - Adjacent: [[Chaos Engineering]] · [[Eventual Consistency]] · [[CAP Theorem]] ## 🤖 LLM 활용 **언제**: 매 distributed system design 시 failure mode enumeration, supervisor tree 설계, retry strategy 추천. **언제 X**: 매 single-process script — fault tolerance overhead 가 value 보다 큼. ## ❌ 안티패턴 - **Catch-all exception swallow**: 매 error를 log만 하고 무시 → 매 silent corruption. - **Infinite retry**: 매 backoff 없는 retry → 매 thundering herd, cascading failure. - **Shared fate**: 매 단일 DB 의존 모든 service → 매 single point of failure. - **No timeout**: 매 hang된 dependency가 매 caller exhaust. ## 🧪 검증 / 중복 - Verified (Joe Armstrong, "Making Reliable Distributed Systems in the Presence of Software Errors", 2003). - Verified (Netflix Chaos Engineering principles, principlesofchaos.org). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Erlang/OTP + modern resilience patterns |