Files
2nd/10_Wiki/Topics/Architecture/Fault-Tolerance.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

6.0 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-fault-tolerance Fault Tolerance 10_Wiki/Topics verified self
Fault Tolerance
장애 내성
Resilience Engineering
none A 0.9 applied
architecture
distributed-systems
resilience
erlang
2026-05-10 pending
language framework
erlang otp

Fault Tolerance

매 한 줄

"매 system은 fail한다 — 매 question은 'when'이지 'if' 아님". 매 fault tolerance는 component failure에도 system이 계속 동작하도록 design — Erlang/OTP의 "let it crash" philosophy에서 modern Kubernetes self-healing까지 evolution. 2026 cloud-native에서는 chaos engineering, circuit breaker, bulkhead가 default.

매 핵심

매 Fault vs Error vs Failure

  • Fault: 매 root cause (bug, hardware glitch, network partition)
  • Error: 매 fault의 manifestation (incorrect state)
  • Failure: 매 service가 contract 위반 (user-visible)
  • 매 goal: fault → error containment, error → failure prevention

매 Erlang Philosophy

  • Let it crash: 매 defensive coding 대신 supervisor가 restart
  • Process isolation: 매 lightweight process per actor, shared-nothing
  • Hot code reload: 매 zero-downtime upgrade
  • 매 WhatsApp이 2 billion users를 50 engineers로 운영한 비결

매 응용

  1. Erlang/OTP supervisor tree (telecom, WhatsApp, Discord).
  2. Kubernetes pod restart + liveness probes.
  3. Circuit breaker (Hystrix, resilience4j).
  4. Distributed databases (Cassandra hinted handoff, Spanner).

💻 패턴

Erlang Supervisor Tree

-module(my_sup).
-behaviour(supervisor).
-export([start_link/0, init/1]).

start_link() ->
    supervisor:start_link({local, ?MODULE}, ?MODULE, []).

init([]) ->
    SupFlags = #{strategy => one_for_one,
                 intensity => 5,
                 period => 10},
    Children = [
        #{id => worker1,
          start => {worker, start_link, []},
          restart => permanent,
          shutdown => 5000,
          type => worker}
    ],
    {ok, {SupFlags, Children}}.

Circuit Breaker (Python)

from pybreaker import CircuitBreaker

db_breaker = CircuitBreaker(fail_max=5, reset_timeout=60)

@db_breaker
def query_db(sql: str):
    return db.execute(sql)

try:
    result = query_db("SELECT * FROM users")
except CircuitBreakerError:
    return cached_response()  # fallback

Retry with Exponential Backoff

import asyncio
import random

async def retry_with_backoff(fn, max_retries=5, base=1.0):
    for attempt in range(max_retries):
        try:
            return await fn()
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            delay = base * (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(delay)

Bulkhead Pattern (Go)

import "golang.org/x/sync/semaphore"

type Service struct {
    dbSem    *semaphore.Weighted  // 10 concurrent DB calls
    apiSem   *semaphore.Weighted  // 50 concurrent API calls
}

func (s *Service) CallDB(ctx context.Context) error {
    if err := s.dbSem.Acquire(ctx, 1); err != nil {
        return err
    }
    defer s.dbSem.Release(1)
    return doDBWork()
}

Kubernetes Liveness/Readiness

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      periodSeconds: 5

Chaos Engineering (Litmus)

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
spec:
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: '60'
        - name: PODS_AFFECTED_PERC
          value: '50'

Saga Pattern (Compensation)

class OrderSaga:
    async def execute(self, order):
        steps = []
        try:
            payment = await charge_card(order)
            steps.append(("refund", payment.id))
            inventory = await reserve_stock(order)
            steps.append(("release", inventory.id))
            await ship_order(order)
        except Exception:
            for action, ref in reversed(steps):
                await compensate(action, ref)
            raise

매 결정 기준

상황 Approach
Telecom-grade uptime (5 nines) Erlang/OTP supervisor tree
Microservices REST Circuit breaker + retry + timeout
Stateful distributed DB Quorum + hinted handoff
Container orchestration K8s liveness/readiness + PodDisruptionBudget
Cross-service transactions Saga + compensation

기본값: 매 timeout + retry + circuit breaker 3종 세트 + chaos testing.

🔗 Graph

🤖 LLM 활용

언제: 매 distributed system design 시 failure mode enumeration, supervisor tree 설계, retry strategy 추천. 언제 X: 매 single-process script — fault tolerance overhead 가 value 보다 큼.

안티패턴

  • Catch-all exception swallow: 매 error를 log만 하고 무시 → 매 silent corruption.
  • Infinite retry: 매 backoff 없는 retry → 매 thundering herd, cascading failure.
  • Shared fate: 매 단일 DB 의존 모든 service → 매 single point of failure.
  • No timeout: 매 hang된 dependency가 매 caller exhaust.

🧪 검증 / 중복

  • Verified (Joe Armstrong, "Making Reliable Distributed Systems in the Presence of Software Errors", 2003).
  • Verified (Netflix Chaos Engineering principles, principlesofchaos.org).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Erlang/OTP + modern resilience patterns