402 lines
7.9 KiB
Markdown
402 lines
7.9 KiB
Markdown
---
|
|
id: testing-chaos-engineering
|
|
title: Chaos Engineering — 의도적 fault injection
|
|
category: Coding
|
|
status: draft
|
|
source_trust_level: B
|
|
verification_status: conceptual
|
|
created_at: 2026-05-09
|
|
updated_at: 2026-05-09
|
|
tags: [testing, chaos, resilience, vibe-coding]
|
|
tech_stack: { language: "any", applicable_to: ["Backend", "DevOps"] }
|
|
applied_in: []
|
|
aliases: [chaos engineering, fault injection, Netflix Simian Army, Chaos Monkey, Gremlin, Litmus]
|
|
---
|
|
|
|
# Chaos Engineering
|
|
|
|
> 시스템 가 prod 의 실제 failure 에 대비. **의도적 fault injection (kill, slow, error) — 평상시**. Netflix Simian Army 의 idea. Litmus / Chaos Mesh / Gremlin.
|
|
|
|
## 📖 핵심 개념
|
|
- "Hope is not a strategy".
|
|
- Failure 가 발생할 거 — 평상시 검증.
|
|
- Hypothesis-driven (X 죽이면 Y 해야 함).
|
|
- Blast radius 작게 시작 (staging → small prod).
|
|
|
|
## 💻 코드 패턴
|
|
|
|
### Chaos Monkey 의 idea
|
|
```
|
|
"무작위 EC2 instance 죽임".
|
|
|
|
→ 가정: HA 가 동작.
|
|
→ 실제 검증: 1 node down 후 service OK?
|
|
|
|
Netflix 가 만듦 (2010).
|
|
```
|
|
|
|
### Hypothesis 기반 실험
|
|
```
|
|
"가정: Redis cache 죽으면 latency p99 가 200ms → 500ms 안 (degraded OK)."
|
|
|
|
실험:
|
|
1. Steady state metric 측정 (정상)
|
|
2. Redis kill
|
|
3. 5 분 측정
|
|
4. Restore
|
|
5. 결과 분석
|
|
```
|
|
|
|
→ 결과 가 expected = OK. 다름 = bug fix.
|
|
|
|
### Chaos Mesh (K8s)
|
|
```yaml
|
|
# pod-kill.yaml
|
|
apiVersion: chaos-mesh.org/v1alpha1
|
|
kind: PodChaos
|
|
metadata:
|
|
name: kill-redis
|
|
spec:
|
|
action: pod-kill
|
|
mode: one
|
|
selector:
|
|
namespaces: [production]
|
|
labelSelectors:
|
|
app: redis
|
|
scheduler:
|
|
cron: '@every 1h'
|
|
```
|
|
|
|
```bash
|
|
kubectl apply -f pod-kill.yaml
|
|
```
|
|
|
|
→ 매 1시간 random Redis pod kill.
|
|
|
|
### Network latency
|
|
```yaml
|
|
apiVersion: chaos-mesh.org/v1alpha1
|
|
kind: NetworkChaos
|
|
metadata:
|
|
name: slow-db
|
|
spec:
|
|
action: delay
|
|
mode: one
|
|
selector:
|
|
labelSelectors: { app: postgres }
|
|
delay:
|
|
latency: '500ms'
|
|
correlation: '50'
|
|
jitter: '100ms'
|
|
duration: '5m'
|
|
```
|
|
|
|
→ DB 가 500ms slower.
|
|
|
|
### Network partition
|
|
```yaml
|
|
apiVersion: chaos-mesh.org/v1alpha1
|
|
kind: NetworkChaos
|
|
spec:
|
|
action: partition
|
|
direction: both
|
|
selector:
|
|
labelSelectors: { app: redis }
|
|
target:
|
|
selector:
|
|
labelSelectors: { app: api }
|
|
```
|
|
|
|
→ API 가 Redis 와 통신 X.
|
|
|
|
### CPU stress
|
|
```yaml
|
|
apiVersion: chaos-mesh.org/v1alpha1
|
|
kind: StressChaos
|
|
spec:
|
|
mode: one
|
|
selector: { labelSelectors: { app: api } }
|
|
stressors:
|
|
cpu: { workers: 4, load: 80 }
|
|
duration: '10m'
|
|
```
|
|
|
|
### Memory leak
|
|
```yaml
|
|
stressors:
|
|
memory:
|
|
workers: 1
|
|
size: '1GB'
|
|
```
|
|
|
|
### IO chaos (disk)
|
|
```yaml
|
|
apiVersion: chaos-mesh.org/v1alpha1
|
|
kind: IOChaos
|
|
spec:
|
|
action: latency
|
|
mode: one
|
|
selector: { labelSelectors: { app: postgres } }
|
|
delay: '500ms'
|
|
path: '/var/lib/postgresql'
|
|
```
|
|
|
|
### HTTP chaos (response 가짜)
|
|
```yaml
|
|
apiVersion: chaos-mesh.org/v1alpha1
|
|
kind: HTTPChaos
|
|
spec:
|
|
mode: one
|
|
target: Request
|
|
port: 80
|
|
delay: '5s'
|
|
abort: false
|
|
selector: { labelSelectors: { app: api } }
|
|
```
|
|
|
|
→ HTTP request 5초 delay.
|
|
|
|
### Litmus Chaos
|
|
```yaml
|
|
# litmus experiment
|
|
apiVersion: litmuschaos.io/v1alpha1
|
|
kind: ChaosEngine
|
|
metadata:
|
|
name: api-chaos
|
|
spec:
|
|
appinfo:
|
|
appns: production
|
|
applabel: 'app=api'
|
|
experiments:
|
|
- name: pod-delete
|
|
spec:
|
|
components:
|
|
env:
|
|
- name: TOTAL_CHAOS_DURATION
|
|
value: '60'
|
|
- name: CHAOS_INTERVAL
|
|
value: '10'
|
|
```
|
|
|
|
→ 60초 동안 매 10초 pod kill.
|
|
|
|
### Gremlin (managed SaaS)
|
|
```bash
|
|
# CLI
|
|
gremlin attack-container --target docker --type cpu --cpu-percent 80 --length 60
|
|
|
|
gremlin attack --type latency --length 60 --ms 500
|
|
```
|
|
|
|
→ UI + scenario library.
|
|
|
|
### Application-level chaos
|
|
```ts
|
|
// Toxic proxy 또는 in-app
|
|
import { toxiproxy } from 'toxiproxy';
|
|
|
|
// DB connection 가 toxiproxy 통과
|
|
const proxy = await toxiproxy.populate({
|
|
name: 'postgres',
|
|
listen: '127.0.0.1:5433',
|
|
upstream: 'postgres:5432',
|
|
});
|
|
|
|
// 500ms latency 추가
|
|
await proxy.addToxic({ type: 'latency', attributes: { latency: 500 } });
|
|
```
|
|
|
|
→ 정밀 control.
|
|
|
|
### Game day
|
|
```
|
|
1 day 가 chaos 위주.
|
|
- Plan: 5 hypothesis
|
|
- Execute: 매 1시간 1 실험
|
|
- Observe: real-time
|
|
- Document: post-mortem 식
|
|
|
|
→ Team 가 incident 의 운영 학습.
|
|
```
|
|
|
|
### Steady state
|
|
```
|
|
"정상 = 무엇?" 정의.
|
|
|
|
Metric:
|
|
- Request per minute
|
|
- Error rate < 0.1%
|
|
- p99 latency < 200ms
|
|
- Active user count
|
|
|
|
→ 실험 시작 전 baseline 측정.
|
|
실험 중 vs baseline 비교.
|
|
```
|
|
|
|
### Blast radius
|
|
```
|
|
1단계: dev 환경
|
|
2단계: staging
|
|
3단계: prod 의 1% (canary)
|
|
4단계: prod 의 10%
|
|
5단계: prod 의 100%
|
|
|
|
→ 단계 별 검증.
|
|
```
|
|
|
|
### Auto-rollback
|
|
```ts
|
|
// Chaos 실험
|
|
async function chaosWithGuard(experiment, abortIf) {
|
|
const monitor = setInterval(() => {
|
|
if (abortIf()) {
|
|
experiment.stop();
|
|
log.warn('chaos aborted');
|
|
}
|
|
}, 5000);
|
|
|
|
await experiment.run();
|
|
clearInterval(monitor);
|
|
}
|
|
|
|
await chaosWithGuard(
|
|
killPod('redis'),
|
|
() => errorRate() > 0.1,
|
|
);
|
|
```
|
|
|
|
→ Error rate > 10% = 즉시 멈춤.
|
|
|
|
### Common 실험
|
|
```
|
|
1. 1 instance kill — HA 검증
|
|
2. AZ down — multi-AZ 검증
|
|
3. DB master kill — failover 검증
|
|
4. Cache down — DB 가 견딤?
|
|
5. Slow network — timeout / retry?
|
|
6. CPU spike — autoscale?
|
|
7. Disk full — alert + degrade?
|
|
8. Dependency down — fallback?
|
|
```
|
|
|
|
### Failure injection in code
|
|
```ts
|
|
// Chaos middleware (test 만)
|
|
app.use((req, res, next) => {
|
|
if (process.env.CHAOS && Math.random() < 0.01) {
|
|
return res.status(503).end(); // 1% random 503
|
|
}
|
|
next();
|
|
});
|
|
```
|
|
|
|
→ Staging 에서 chaos = continuous.
|
|
|
|
### Service mesh chaos (Istio)
|
|
```yaml
|
|
apiVersion: networking.istio.io/v1alpha3
|
|
kind: VirtualService
|
|
spec:
|
|
http:
|
|
- fault:
|
|
delay:
|
|
fixedDelay: 5s
|
|
percentage: { value: 10.0 }
|
|
abort:
|
|
httpStatus: 500
|
|
percentage: { value: 1.0 }
|
|
route:
|
|
- destination: { host: backend }
|
|
```
|
|
|
|
→ 10% latency, 1% 500.
|
|
|
|
### LLM chaos
|
|
```ts
|
|
// LLM API 가 unreliable
|
|
const r = await retryWithBackoff(() => llm.complete(prompt));
|
|
|
|
// Chaos: API timeout simulate
|
|
if (Math.random() < 0.05) await sleep(60_000);
|
|
```
|
|
|
|
### 결과 분석
|
|
```
|
|
실험 후:
|
|
- Steady state 가 회복?
|
|
- 회복 시간?
|
|
- User-facing impact?
|
|
- Alert 가 발생?
|
|
- 대응 가 자동?
|
|
- 학습 점?
|
|
|
|
→ Document. Bug fix. Re-run.
|
|
```
|
|
|
|
### Postmortem 식 (자기 자신)
|
|
```markdown
|
|
## Chaos: 2026-05-09 Redis kill
|
|
|
|
Hypothesis: Redis down → cache miss → DB latency ↑ but no errors.
|
|
|
|
Observed:
|
|
- p99 latency: 200ms → 1.5s (worse than expected)
|
|
- Error rate: 0% → 0.3% (some queries timeout)
|
|
- Recovery: 30 sec after Redis up
|
|
|
|
Action items:
|
|
- [ ] DB query timeout 증가 (5s → 10s)
|
|
- [ ] Connection pool max 늘리기
|
|
- [ ] Re-test
|
|
```
|
|
|
|
### When 시작
|
|
```
|
|
Pre-req:
|
|
- Monitoring (RED metrics)
|
|
- Alerting (PagerDuty)
|
|
- HA (multi-instance, multi-AZ)
|
|
- Runbook
|
|
- 팀 가 운영 가능
|
|
|
|
→ 없음 = chaos 가 진짜 incident 가 됨.
|
|
```
|
|
|
|
### Adoption story
|
|
- **Netflix**: Simian Army (2011+).
|
|
- **Amazon**: GameDay (Werner Vogels).
|
|
- **Slack**: Disasterpiece (annual).
|
|
- **LinkedIn**: WaterBear.
|
|
- **Gremlin (company)**: Failure-as-a-Service.
|
|
|
|
## 🤔 의사결정 기준
|
|
| 상황 | 추천 |
|
|
|---|---|
|
|
| 작은 팀 / 시작 | Toxiproxy + manual |
|
|
| K8s | Chaos Mesh / Litmus |
|
|
| Managed | Gremlin |
|
|
| Service mesh | Istio fault injection |
|
|
| Continuous | Chaos Monkey-style cron |
|
|
| Game day | 1 day 별 |
|
|
| App-level | In-app middleware |
|
|
|
|
## ❌ 안티패턴
|
|
- **Prod 에서 첫 실험**: incident.
|
|
- **Steady state 정의 X**: 결과 모름.
|
|
- **Blast radius 가 큰**: 진짜 incident.
|
|
- **Auto-abort 없음**: chaos 가 incident.
|
|
- **Document 안 함**: 학습 X.
|
|
- **Monitoring 없이**: 결과 X.
|
|
- **HA 안 갖춰진 system**: chaos 가 부순다.
|
|
|
|
## 🤖 LLM 활용 힌트
|
|
- Chaos = hypothesis 기반 실험.
|
|
- Pre-req: monitoring + HA + runbook.
|
|
- Blast radius 점진 확대.
|
|
- Auto-abort + steady state 가 필수.
|
|
|
|
## 🔗 관련 문서
|
|
- [[Backend_Circuit_Breaker]]
|
|
- [[Productivity_Postmortem]]
|
|
- [[DevOps_Disaster_Recovery]]
|