Files
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

211 lines
6.0 KiB
Markdown

---
id: wiki-2026-0508-fault-tolerance
title: Fault Tolerance
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Fault Tolerance, 장애 내성, Resilience Engineering]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [architecture, distributed-systems, resilience, erlang]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: erlang
framework: otp
---
# Fault Tolerance
## 매 한 줄
> **"매 system은 fail한다 — 매 question은 'when'이지 'if' 아님"**. 매 fault tolerance는 component failure에도 system이 계속 동작하도록 design — Erlang/OTP의 "let it crash" philosophy에서 modern Kubernetes self-healing까지 evolution. 2026 cloud-native에서는 chaos engineering, circuit breaker, bulkhead가 default.
## 매 핵심
### 매 Fault vs Error vs Failure
- **Fault**: 매 root cause (bug, hardware glitch, network partition)
- **Error**: 매 fault의 manifestation (incorrect state)
- **Failure**: 매 service가 contract 위반 (user-visible)
- 매 goal: fault → error containment, error → failure prevention
### 매 Erlang Philosophy
- **Let it crash**: 매 defensive coding 대신 supervisor가 restart
- **Process isolation**: 매 lightweight process per actor, shared-nothing
- **Hot code reload**: 매 zero-downtime upgrade
- 매 WhatsApp이 2 billion users를 50 engineers로 운영한 비결
### 매 응용
1. Erlang/OTP supervisor tree (telecom, WhatsApp, Discord).
2. Kubernetes pod restart + liveness probes.
3. Circuit breaker (Hystrix, resilience4j).
4. Distributed databases (Cassandra hinted handoff, Spanner).
## 💻 패턴
### Erlang Supervisor Tree
```erlang
-module(my_sup).
-behaviour(supervisor).
-export([start_link/0, init/1]).
start_link() ->
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
init([]) ->
SupFlags = #{strategy => one_for_one,
intensity => 5,
period => 10},
Children = [
#{id => worker1,
start => {worker, start_link, []},
restart => permanent,
shutdown => 5000,
type => worker}
],
{ok, {SupFlags, Children}}.
```
### Circuit Breaker (Python)
```python
from pybreaker import CircuitBreaker
db_breaker = CircuitBreaker(fail_max=5, reset_timeout=60)
@db_breaker
def query_db(sql: str):
return db.execute(sql)
try:
result = query_db("SELECT * FROM users")
except CircuitBreakerError:
return cached_response() # fallback
```
### Retry with Exponential Backoff
```python
import asyncio
import random
async def retry_with_backoff(fn, max_retries=5, base=1.0):
for attempt in range(max_retries):
try:
return await fn()
except Exception as e:
if attempt == max_retries - 1:
raise
delay = base * (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
```
### Bulkhead Pattern (Go)
```go
import "golang.org/x/sync/semaphore"
type Service struct {
dbSem *semaphore.Weighted // 10 concurrent DB calls
apiSem *semaphore.Weighted // 50 concurrent API calls
}
func (s *Service) CallDB(ctx context.Context) error {
if err := s.dbSem.Acquire(ctx, 1); err != nil {
return err
}
defer s.dbSem.Release(1)
return doDBWork()
}
```
### Kubernetes Liveness/Readiness
```yaml
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
```
### Chaos Engineering (Litmus)
```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
spec:
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: PODS_AFFECTED_PERC
value: '50'
```
### Saga Pattern (Compensation)
```python
class OrderSaga:
async def execute(self, order):
steps = []
try:
payment = await charge_card(order)
steps.append(("refund", payment.id))
inventory = await reserve_stock(order)
steps.append(("release", inventory.id))
await ship_order(order)
except Exception:
for action, ref in reversed(steps):
await compensate(action, ref)
raise
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Telecom-grade uptime (5 nines) | Erlang/OTP supervisor tree |
| Microservices REST | Circuit breaker + retry + timeout |
| Stateful distributed DB | Quorum + hinted handoff |
| Container orchestration | K8s liveness/readiness + PodDisruptionBudget |
| Cross-service transactions | Saga + compensation |
**기본값**: 매 timeout + retry + circuit breaker 3종 세트 + chaos testing.
## 🔗 Graph
- 부모: [[Distributed Systems]]
- 변형: [[Circuit Breaker]]
- 응용: [[Kubernetes]] · [[Microservices]]
- Adjacent: [[Chaos Engineering]] · [[Eventual Consistency]] · [[CAP Theorem & PACELC]]
## 🤖 LLM 활용
**언제**: 매 distributed system design 시 failure mode enumeration, supervisor tree 설계, retry strategy 추천.
**언제 X**: 매 single-process script — fault tolerance overhead 가 value 보다 큼.
## ❌ 안티패턴
- **Catch-all exception swallow**: 매 error를 log만 하고 무시 → 매 silent corruption.
- **Infinite retry**: 매 backoff 없는 retry → 매 thundering herd, cascading failure.
- **Shared fate**: 매 단일 DB 의존 모든 service → 매 single point of failure.
- **No timeout**: 매 hang된 dependency가 매 caller exhaust.
## 🧪 검증 / 중복
- Verified (Joe Armstrong, "Making Reliable Distributed Systems in the Presence of Software Errors", 2003).
- Verified (Netflix Chaos Engineering principles, principlesofchaos.org).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Erlang/OTP + modern resilience patterns |