7.9 KiB
7.9 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| testing-chaos-engineering | Chaos Engineering — 의도적 fault injection | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Chaos Engineering
시스템 가 prod 의 실제 failure 에 대비. 의도적 fault injection (kill, slow, error) — 평상시. Netflix Simian Army 의 idea. Litmus / Chaos Mesh / Gremlin.
📖 핵심 개념
- "Hope is not a strategy".
- Failure 가 발생할 거 — 평상시 검증.
- Hypothesis-driven (X 죽이면 Y 해야 함).
- Blast radius 작게 시작 (staging → small prod).
💻 코드 패턴
Chaos Monkey 의 idea
"무작위 EC2 instance 죽임".
→ 가정: HA 가 동작.
→ 실제 검증: 1 node down 후 service OK?
Netflix 가 만듦 (2010).
Hypothesis 기반 실험
"가정: Redis cache 죽으면 latency p99 가 200ms → 500ms 안 (degraded OK)."
실험:
1. Steady state metric 측정 (정상)
2. Redis kill
3. 5 분 측정
4. Restore
5. 결과 분석
→ 결과 가 expected = OK. 다름 = bug fix.
Chaos Mesh (K8s)
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-redis
spec:
action: pod-kill
mode: one
selector:
namespaces: [production]
labelSelectors:
app: redis
scheduler:
cron: '@every 1h'
kubectl apply -f pod-kill.yaml
→ 매 1시간 random Redis pod kill.
Network latency
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: slow-db
spec:
action: delay
mode: one
selector:
labelSelectors: { app: postgres }
delay:
latency: '500ms'
correlation: '50'
jitter: '100ms'
duration: '5m'
→ DB 가 500ms slower.
Network partition
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
spec:
action: partition
direction: both
selector:
labelSelectors: { app: redis }
target:
selector:
labelSelectors: { app: api }
→ API 가 Redis 와 통신 X.
CPU stress
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
spec:
mode: one
selector: { labelSelectors: { app: api } }
stressors:
cpu: { workers: 4, load: 80 }
duration: '10m'
Memory leak
stressors:
memory:
workers: 1
size: '1GB'
IO chaos (disk)
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
spec:
action: latency
mode: one
selector: { labelSelectors: { app: postgres } }
delay: '500ms'
path: '/var/lib/postgresql'
HTTP chaos (response 가짜)
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
spec:
mode: one
target: Request
port: 80
delay: '5s'
abort: false
selector: { labelSelectors: { app: api } }
→ HTTP request 5초 delay.
Litmus Chaos
# litmus experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-chaos
spec:
appinfo:
appns: production
applabel: 'app=api'
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
→ 60초 동안 매 10초 pod kill.
Gremlin (managed SaaS)
# CLI
gremlin attack-container --target docker --type cpu --cpu-percent 80 --length 60
gremlin attack --type latency --length 60 --ms 500
→ UI + scenario library.
Application-level chaos
// Toxic proxy 또는 in-app
import { toxiproxy } from 'toxiproxy';
// DB connection 가 toxiproxy 통과
const proxy = await toxiproxy.populate({
name: 'postgres',
listen: '127.0.0.1:5433',
upstream: 'postgres:5432',
});
// 500ms latency 추가
await proxy.addToxic({ type: 'latency', attributes: { latency: 500 } });
→ 정밀 control.
Game day
1 day 가 chaos 위주.
- Plan: 5 hypothesis
- Execute: 매 1시간 1 실험
- Observe: real-time
- Document: post-mortem 식
→ Team 가 incident 의 운영 학습.
Steady state
"정상 = 무엇?" 정의.
Metric:
- Request per minute
- Error rate < 0.1%
- p99 latency < 200ms
- Active user count
→ 실험 시작 전 baseline 측정.
실험 중 vs baseline 비교.
Blast radius
1단계: dev 환경
2단계: staging
3단계: prod 의 1% (canary)
4단계: prod 의 10%
5단계: prod 의 100%
→ 단계 별 검증.
Auto-rollback
// Chaos 실험
async function chaosWithGuard(experiment, abortIf) {
const monitor = setInterval(() => {
if (abortIf()) {
experiment.stop();
log.warn('chaos aborted');
}
}, 5000);
await experiment.run();
clearInterval(monitor);
}
await chaosWithGuard(
killPod('redis'),
() => errorRate() > 0.1,
);
→ Error rate > 10% = 즉시 멈춤.
Common 실험
1. 1 instance kill — HA 검증
2. AZ down — multi-AZ 검증
3. DB master kill — failover 검증
4. Cache down — DB 가 견딤?
5. Slow network — timeout / retry?
6. CPU spike — autoscale?
7. Disk full — alert + degrade?
8. Dependency down — fallback?
Failure injection in code
// Chaos middleware (test 만)
app.use((req, res, next) => {
if (process.env.CHAOS && Math.random() < 0.01) {
return res.status(503).end(); // 1% random 503
}
next();
});
→ Staging 에서 chaos = continuous.
Service mesh chaos (Istio)
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
http:
- fault:
delay:
fixedDelay: 5s
percentage: { value: 10.0 }
abort:
httpStatus: 500
percentage: { value: 1.0 }
route:
- destination: { host: backend }
→ 10% latency, 1% 500.
LLM chaos
// LLM API 가 unreliable
const r = await retryWithBackoff(() => llm.complete(prompt));
// Chaos: API timeout simulate
if (Math.random() < 0.05) await sleep(60_000);
결과 분석
실험 후:
- Steady state 가 회복?
- 회복 시간?
- User-facing impact?
- Alert 가 발생?
- 대응 가 자동?
- 학습 점?
→ Document. Bug fix. Re-run.
Postmortem 식 (자기 자신)
## Chaos: 2026-05-09 Redis kill
Hypothesis: Redis down → cache miss → DB latency ↑ but no errors.
Observed:
- p99 latency: 200ms → 1.5s (worse than expected)
- Error rate: 0% → 0.3% (some queries timeout)
- Recovery: 30 sec after Redis up
Action items:
- [ ] DB query timeout 증가 (5s → 10s)
- [ ] Connection pool max 늘리기
- [ ] Re-test
When 시작
Pre-req:
- Monitoring (RED metrics)
- Alerting (PagerDuty)
- HA (multi-instance, multi-AZ)
- Runbook
- 팀 가 운영 가능
→ 없음 = chaos 가 진짜 incident 가 됨.
Adoption story
- Netflix: Simian Army (2011+).
- Amazon: GameDay (Werner Vogels).
- Slack: Disasterpiece (annual).
- LinkedIn: WaterBear.
- Gremlin (company): Failure-as-a-Service.
🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 작은 팀 / 시작 | Toxiproxy + manual |
| K8s | Chaos Mesh / Litmus |
| Managed | Gremlin |
| Service mesh | Istio fault injection |
| Continuous | Chaos Monkey-style cron |
| Game day | 1 day 별 |
| App-level | In-app middleware |
❌ 안티패턴
- Prod 에서 첫 실험: incident.
- Steady state 정의 X: 결과 모름.
- Blast radius 가 큰: 진짜 incident.
- Auto-abort 없음: chaos 가 incident.
- Document 안 함: 학습 X.
- Monitoring 없이: 결과 X.
- HA 안 갖춰진 system: chaos 가 부순다.
🤖 LLM 활용 힌트
- Chaos = hypothesis 기반 실험.
- Pre-req: monitoring + HA + runbook.
- Blast radius 점진 확대.
- Auto-abort + steady state 가 필수.