f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
211 lines
6.0 KiB
Markdown
211 lines
6.0 KiB
Markdown
---
|
|
id: wiki-2026-0508-fault-tolerance
|
|
title: Fault Tolerance
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Fault Tolerance, 장애 내성, Resilience Engineering]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [architecture, distributed-systems, resilience, erlang]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: erlang
|
|
framework: otp
|
|
---
|
|
|
|
# Fault Tolerance
|
|
|
|
## 매 한 줄
|
|
> **"매 system은 fail한다 — 매 question은 'when'이지 'if' 아님"**. 매 fault tolerance는 component failure에도 system이 계속 동작하도록 design — Erlang/OTP의 "let it crash" philosophy에서 modern Kubernetes self-healing까지 evolution. 2026 cloud-native에서는 chaos engineering, circuit breaker, bulkhead가 default.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 Fault vs Error vs Failure
|
|
- **Fault**: 매 root cause (bug, hardware glitch, network partition)
|
|
- **Error**: 매 fault의 manifestation (incorrect state)
|
|
- **Failure**: 매 service가 contract 위반 (user-visible)
|
|
- 매 goal: fault → error containment, error → failure prevention
|
|
|
|
### 매 Erlang Philosophy
|
|
- **Let it crash**: 매 defensive coding 대신 supervisor가 restart
|
|
- **Process isolation**: 매 lightweight process per actor, shared-nothing
|
|
- **Hot code reload**: 매 zero-downtime upgrade
|
|
- 매 WhatsApp이 2 billion users를 50 engineers로 운영한 비결
|
|
|
|
### 매 응용
|
|
1. Erlang/OTP supervisor tree (telecom, WhatsApp, Discord).
|
|
2. Kubernetes pod restart + liveness probes.
|
|
3. Circuit breaker (Hystrix, resilience4j).
|
|
4. Distributed databases (Cassandra hinted handoff, Spanner).
|
|
|
|
## 💻 패턴
|
|
|
|
### Erlang Supervisor Tree
|
|
```erlang
|
|
-module(my_sup).
|
|
-behaviour(supervisor).
|
|
-export([start_link/0, init/1]).
|
|
|
|
start_link() ->
|
|
supervisor:start_link({local, ?MODULE}, ?MODULE, []).
|
|
|
|
init([]) ->
|
|
SupFlags = #{strategy => one_for_one,
|
|
intensity => 5,
|
|
period => 10},
|
|
Children = [
|
|
#{id => worker1,
|
|
start => {worker, start_link, []},
|
|
restart => permanent,
|
|
shutdown => 5000,
|
|
type => worker}
|
|
],
|
|
{ok, {SupFlags, Children}}.
|
|
```
|
|
|
|
### Circuit Breaker (Python)
|
|
```python
|
|
from pybreaker import CircuitBreaker
|
|
|
|
db_breaker = CircuitBreaker(fail_max=5, reset_timeout=60)
|
|
|
|
@db_breaker
|
|
def query_db(sql: str):
|
|
return db.execute(sql)
|
|
|
|
try:
|
|
result = query_db("SELECT * FROM users")
|
|
except CircuitBreakerError:
|
|
return cached_response() # fallback
|
|
```
|
|
|
|
### Retry with Exponential Backoff
|
|
```python
|
|
import asyncio
|
|
import random
|
|
|
|
async def retry_with_backoff(fn, max_retries=5, base=1.0):
|
|
for attempt in range(max_retries):
|
|
try:
|
|
return await fn()
|
|
except Exception as e:
|
|
if attempt == max_retries - 1:
|
|
raise
|
|
delay = base * (2 ** attempt) + random.uniform(0, 1)
|
|
await asyncio.sleep(delay)
|
|
```
|
|
|
|
### Bulkhead Pattern (Go)
|
|
```go
|
|
import "golang.org/x/sync/semaphore"
|
|
|
|
type Service struct {
|
|
dbSem *semaphore.Weighted // 10 concurrent DB calls
|
|
apiSem *semaphore.Weighted // 50 concurrent API calls
|
|
}
|
|
|
|
func (s *Service) CallDB(ctx context.Context) error {
|
|
if err := s.dbSem.Acquire(ctx, 1); err != nil {
|
|
return err
|
|
}
|
|
defer s.dbSem.Release(1)
|
|
return doDBWork()
|
|
}
|
|
```
|
|
|
|
### Kubernetes Liveness/Readiness
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
spec:
|
|
containers:
|
|
- name: app
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /healthz
|
|
port: 8080
|
|
initialDelaySeconds: 15
|
|
periodSeconds: 10
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /ready
|
|
port: 8080
|
|
periodSeconds: 5
|
|
```
|
|
|
|
### Chaos Engineering (Litmus)
|
|
```yaml
|
|
apiVersion: litmuschaos.io/v1alpha1
|
|
kind: ChaosEngine
|
|
spec:
|
|
experiments:
|
|
- name: pod-delete
|
|
spec:
|
|
components:
|
|
env:
|
|
- name: TOTAL_CHAOS_DURATION
|
|
value: '60'
|
|
- name: PODS_AFFECTED_PERC
|
|
value: '50'
|
|
```
|
|
|
|
### Saga Pattern (Compensation)
|
|
```python
|
|
class OrderSaga:
|
|
async def execute(self, order):
|
|
steps = []
|
|
try:
|
|
payment = await charge_card(order)
|
|
steps.append(("refund", payment.id))
|
|
inventory = await reserve_stock(order)
|
|
steps.append(("release", inventory.id))
|
|
await ship_order(order)
|
|
except Exception:
|
|
for action, ref in reversed(steps):
|
|
await compensate(action, ref)
|
|
raise
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Telecom-grade uptime (5 nines) | Erlang/OTP supervisor tree |
|
|
| Microservices REST | Circuit breaker + retry + timeout |
|
|
| Stateful distributed DB | Quorum + hinted handoff |
|
|
| Container orchestration | K8s liveness/readiness + PodDisruptionBudget |
|
|
| Cross-service transactions | Saga + compensation |
|
|
|
|
**기본값**: 매 timeout + retry + circuit breaker 3종 세트 + chaos testing.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Distributed Systems]]
|
|
- 변형: [[Circuit Breaker]]
|
|
- 응용: [[Kubernetes]] · [[Microservices]]
|
|
- Adjacent: [[Chaos Engineering]] · [[Eventual Consistency]] · [[CAP Theorem]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 distributed system design 시 failure mode enumeration, supervisor tree 설계, retry strategy 추천.
|
|
**언제 X**: 매 single-process script — fault tolerance overhead 가 value 보다 큼.
|
|
|
|
## ❌ 안티패턴
|
|
- **Catch-all exception swallow**: 매 error를 log만 하고 무시 → 매 silent corruption.
|
|
- **Infinite retry**: 매 backoff 없는 retry → 매 thundering herd, cascading failure.
|
|
- **Shared fate**: 매 단일 DB 의존 모든 service → 매 single point of failure.
|
|
- **No timeout**: 매 hang된 dependency가 매 caller exhaust.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Joe Armstrong, "Making Reliable Distributed Systems in the Presence of Software Errors", 2003).
|
|
- Verified (Netflix Chaos Engineering principles, principlesofchaos.org).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — Erlang/OTP + modern resilience patterns |
|