d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6.0 KiB
6.0 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-fault-tolerance | Fault Tolerance | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
Fault Tolerance
매 한 줄
"매 system은 fail한다 — 매 question은 'when'이지 'if' 아님". 매 fault tolerance는 component failure에도 system이 계속 동작하도록 design — Erlang/OTP의 "let it crash" philosophy에서 modern Kubernetes self-healing까지 evolution. 2026 cloud-native에서는 chaos engineering, circuit breaker, bulkhead가 default.
매 핵심
매 Fault vs Error vs Failure
- Fault: 매 root cause (bug, hardware glitch, network partition)
- Error: 매 fault의 manifestation (incorrect state)
- Failure: 매 service가 contract 위반 (user-visible)
- 매 goal: fault → error containment, error → failure prevention
매 Erlang Philosophy
- Let it crash: 매 defensive coding 대신 supervisor가 restart
- Process isolation: 매 lightweight process per actor, shared-nothing
- Hot code reload: 매 zero-downtime upgrade
- 매 WhatsApp이 2 billion users를 50 engineers로 운영한 비결
매 응용
- Erlang/OTP supervisor tree (telecom, WhatsApp, Discord).
- Kubernetes pod restart + liveness probes.
- Circuit breaker (Hystrix, resilience4j).
- Distributed databases (Cassandra hinted handoff, Spanner).
💻 패턴
Erlang Supervisor Tree
-module(my_sup).
-behaviour(supervisor).
-export([start_link/0, init/1]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
SupFlags = #{strategy => one_for_one,
intensity => 5,
period => 10},
Children = [
#{id => worker1,
start => {worker, start_link, []},
restart => permanent,
shutdown => 5000,
type => worker}
],
{ok, {SupFlags, Children}}.
Circuit Breaker (Python)
from pybreaker import CircuitBreaker
db_breaker = CircuitBreaker(fail_max=5, reset_timeout=60)
@db_breaker
def query_db(sql: str):
return db.execute(sql)
try:
result = query_db("SELECT * FROM users")
except CircuitBreakerError:
return cached_response() # fallback
Retry with Exponential Backoff
import asyncio
import random
async def retry_with_backoff(fn, max_retries=5, base=1.0):
for attempt in range(max_retries):
try:
return await fn()
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base * (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
Bulkhead Pattern (Go)
import "golang.org/x/sync/semaphore"
type Service struct {
dbSem *semaphore.Weighted // 10 concurrent DB calls
apiSem *semaphore.Weighted // 50 concurrent API calls
}
func (s *Service) CallDB(ctx context.Context) error {
if err := s.dbSem.Acquire(ctx, 1); err != nil {
return err
}
defer s.dbSem.Release(1)
return doDBWork()
}
Kubernetes Liveness/Readiness
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
Chaos Engineering (Litmus)
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
spec:
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: PODS_AFFECTED_PERC
value: '50'
Saga Pattern (Compensation)
class OrderSaga:
async def execute(self, order):
steps = []
try:
payment = await charge_card(order)
steps.append(("refund", payment.id))
inventory = await reserve_stock(order)
steps.append(("release", inventory.id))
await ship_order(order)
except Exception:
for action, ref in reversed(steps):
await compensate(action, ref)
raise
매 결정 기준
| 상황 | Approach |
|---|---|
| Telecom-grade uptime (5 nines) | Erlang/OTP supervisor tree |
| Microservices REST | Circuit breaker + retry + timeout |
| Stateful distributed DB | Quorum + hinted handoff |
| Container orchestration | K8s liveness/readiness + PodDisruptionBudget |
| Cross-service transactions | Saga + compensation |
기본값: 매 timeout + retry + circuit breaker 3종 세트 + chaos testing.
🔗 Graph
- 부모: Distributed Systems
- 변형: Circuit Breaker
- 응용: Kubernetes · Microservices
- Adjacent: Chaos Engineering · Eventual Consistency · CAP Theorem & PACELC
🤖 LLM 활용
언제: 매 distributed system design 시 failure mode enumeration, supervisor tree 설계, retry strategy 추천. 언제 X: 매 single-process script — fault tolerance overhead 가 value 보다 큼.
❌ 안티패턴
- Catch-all exception swallow: 매 error를 log만 하고 무시 → 매 silent corruption.
- Infinite retry: 매 backoff 없는 retry → 매 thundering herd, cascading failure.
- Shared fate: 매 단일 DB 의존 모든 service → 매 single point of failure.
- No timeout: 매 hang된 dependency가 매 caller exhaust.
🧪 검증 / 중복
- Verified (Joe Armstrong, "Making Reliable Distributed Systems in the Presence of Software Errors", 2003).
- Verified (Netflix Chaos Engineering principles, principlesofchaos.org).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Erlang/OTP + modern resilience patterns |