2nd/10_Wiki/Topics/Coding/Testing_Chaos_Engineering.md

---
id: testing-chaos-engineering
title: Chaos Engineering — 의도적 fault injection
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [testing, chaos, resilience, vibe-coding]
tech_stack: { language: "any", applicable_to: ["Backend", "DevOps"] }
applied_in: []
aliases: [chaos engineering, fault injection, Netflix Simian Army, Chaos Monkey, Gremlin, Litmus]
---

# Chaos Engineering

> 시스템 가 prod 의 실제 failure 에 대비. **의도적 fault injection (kill, slow, error) — 평상시**. Netflix Simian Army 의 idea. Litmus / Chaos Mesh / Gremlin.

## 📖 핵심 개념
- "Hope is not a strategy".
- Failure 가 발생할 거 — 평상시 검증.
- Hypothesis-driven (X 죽이면 Y 해야 함).
- Blast radius 작게 시작 (staging → small prod).

## 💻 코드 패턴

### Chaos Monkey 의 idea
```
"무작위 EC2 instance 죽임".

→ 가정: HA 가 동작.
→ 실제 검증: 1 node down 후 service OK?

Netflix 가 만듦 (2010).
```

### Hypothesis 기반 실험
```
"가정: Redis cache 죽으면 latency p99 가 200ms → 500ms 안 (degraded OK)."

실험:
1. Steady state metric 측정 (정상)
2. Redis kill
3. 5 분 측정
4. Restore
5. 결과 분석
```

→ 결과 가 expected = OK. 다름 = bug fix.

### Chaos Mesh (K8s)
```yaml
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-redis
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: redis
  scheduler:
    cron: '@every 1h'
```

```bash
kubectl apply -f pod-kill.yaml
```

→ 매 1시간 random Redis pod kill.

### Network latency
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: slow-db
spec:
  action: delay
  mode: one
  selector:
    labelSelectors: { app: postgres }
  delay:
    latency: '500ms'
    correlation: '50'
    jitter: '100ms'
  duration: '5m'
```

→ DB 가 500ms slower.

### Network partition
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
spec:
  action: partition
  direction: both
  selector:
    labelSelectors: { app: redis }
  target:
    selector:
      labelSelectors: { app: api }
```

→ API 가 Redis 와 통신 X.

### CPU stress
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
spec:
  mode: one
  selector: { labelSelectors: { app: api } }
  stressors:
    cpu: { workers: 4, load: 80 }
  duration: '10m'
```

### Memory leak
```yaml
stressors:
  memory:
    workers: 1
    size: '1GB'
```

### IO chaos (disk)
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
spec:
  action: latency
  mode: one
  selector: { labelSelectors: { app: postgres } }
  delay: '500ms'
  path: '/var/lib/postgresql'
```

### HTTP chaos (response 가짜)
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
spec:
  mode: one
  target: Request
  port: 80
  delay: '5s'
  abort: false
  selector: { labelSelectors: { app: api } }
```

→ HTTP request 5초 delay.

### Litmus Chaos
```yaml
# litmus experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-chaos
spec:
  appinfo:
    appns: production
    applabel: 'app=api'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            - name: CHAOS_INTERVAL
              value: '10'
```

→ 60초 동안 매 10초 pod kill.

### Gremlin (managed SaaS)
```bash
# CLI
gremlin attack-container --target docker --type cpu --cpu-percent 80 --length 60

gremlin attack --type latency --length 60 --ms 500
```

→ UI + scenario library.

### Application-level chaos
```ts
// Toxic proxy 또는 in-app
import { toxiproxy } from 'toxiproxy';

// DB connection 가 toxiproxy 통과
const proxy = await toxiproxy.populate({
  name: 'postgres',
  listen: '127.0.0.1:5433',
  upstream: 'postgres:5432',
});

// 500ms latency 추가
await proxy.addToxic({ type: 'latency', attributes: { latency: 500 } });
```

→ 정밀 control.

### Game day
```
1 day 가 chaos 위주.
- Plan: 5 hypothesis
- Execute: 매 1시간 1 실험
- Observe: real-time
- Document: post-mortem 식

→ Team 가 incident 의 운영 학습.
```

### Steady state
```
"정상 = 무엇?" 정의.

Metric:
- Request per minute
- Error rate < 0.1%
- p99 latency < 200ms
- Active user count

→ 실험 시작 전 baseline 측정.
실험 중 vs baseline 비교.
```

### Blast radius
```
1단계: dev 환경
2단계: staging
3단계: prod 의 1% (canary)
4단계: prod 의 10%
5단계: prod 의 100%

→ 단계 별 검증.
```

### Auto-rollback
```ts
// Chaos 실험
async function chaosWithGuard(experiment, abortIf) {
  const monitor = setInterval(() => {
    if (abortIf()) {
      experiment.stop();
      log.warn('chaos aborted');
    }
  }, 5000);

  await experiment.run();
  clearInterval(monitor);
}

await chaosWithGuard(
  killPod('redis'),
  () => errorRate() > 0.1,
);
```

→ Error rate > 10% = 즉시 멈춤.

### Common 실험
```
1. 1 instance kill — HA 검증
2. AZ down — multi-AZ 검증
3. DB master kill — failover 검증
4. Cache down — DB 가 견딤?
5. Slow network — timeout / retry?
6. CPU spike — autoscale?
7. Disk full — alert + degrade?
8. Dependency down — fallback?
```

### Failure injection in code
```ts
// Chaos middleware (test 만)
app.use((req, res, next) => {
  if (process.env.CHAOS && Math.random() < 0.01) {
    return res.status(503).end();  // 1% random 503
  }
  next();
});
```

→ Staging 에서 chaos = continuous.

### Service mesh chaos (Istio)
```yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
  http:
    - fault:
        delay:
          fixedDelay: 5s
          percentage: { value: 10.0 }
        abort:
          httpStatus: 500
          percentage: { value: 1.0 }
      route:
        - destination: { host: backend }
```

→ 10% latency, 1% 500.

### LLM chaos
```ts
// LLM API 가 unreliable
const r = await retryWithBackoff(() => llm.complete(prompt));

// Chaos: API timeout simulate
if (Math.random() < 0.05) await sleep(60_000);
```

### 결과 분석
```
실험 후:
- Steady state 가 회복?
- 회복 시간?
- User-facing impact?
- Alert 가 발생?
- 대응 가 자동?
- 학습 점?

→ Document. Bug fix. Re-run.
```

### Postmortem 식 (자기 자신)
```markdown
## Chaos: 2026-05-09 Redis kill

Hypothesis: Redis down → cache miss → DB latency ↑ but no errors.

Observed:
- p99 latency: 200ms → 1.5s (worse than expected)
- Error rate: 0% → 0.3% (some queries timeout)
- Recovery: 30 sec after Redis up

Action items:
- [ ] DB query timeout 증가 (5s → 10s)
- [ ] Connection pool max 늘리기
- [ ] Re-test
```

### When 시작
```
Pre-req:
- Monitoring (RED metrics)
- Alerting (PagerDuty)
- HA (multi-instance, multi-AZ)
- Runbook
- 팀 가 운영 가능

→ 없음 = chaos 가 진짜 incident 가 됨.
```

### Adoption story
- **Netflix**: Simian Army (2011+).
- **Amazon**: GameDay (Werner Vogels).
- **Slack**: Disasterpiece (annual).
- **LinkedIn**: WaterBear.
- **Gremlin (company)**: Failure-as-a-Service.

## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 작은 팀 / 시작 | Toxiproxy + manual |
| K8s | Chaos Mesh / Litmus |
| Managed | Gremlin |
| Service mesh | Istio fault injection |
| Continuous | Chaos Monkey-style cron |
| Game day | 1 day 별 |
| App-level | In-app middleware |

## ❌ 안티패턴
- **Prod 에서 첫 실험**: incident.
- **Steady state 정의 X**: 결과 모름.
- **Blast radius 가 큰**: 진짜 incident.
- **Auto-abort 없음**: chaos 가 incident.
- **Document 안 함**: 학습 X.
- **Monitoring 없이**: 결과 X.
- **HA 안 갖춰진 system**: chaos 가 부순다.

## 🤖 LLM 활용 힌트
- Chaos = hypothesis 기반 실험.
- Pre-req: monitoring + HA + runbook.
- Blast radius 점진 확대.
- Auto-abort + steady state 가 필수.

## 🔗 관련 문서
- [[Backend_Circuit_Breaker]]
- [[Productivity_Postmortem]]
- [[DevOps_Disaster_Recovery]]