Files
2nd/10_Wiki/Topics/Coding/Testing_Chaos_Engineering.md
T
2026-05-10 22:08:15 +09:00

402 lines
7.9 KiB
Markdown

---
id: testing-chaos-engineering
title: Chaos Engineering — 의도적 fault injection
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [testing, chaos, resilience, vibe-coding]
tech_stack: { language: "any", applicable_to: ["Backend", "DevOps"] }
applied_in: []
aliases: [chaos engineering, fault injection, Netflix Simian Army, Chaos Monkey, Gremlin, Litmus]
---
# Chaos Engineering
> 시스템 가 prod 의 실제 failure 에 대비. **의도적 fault injection (kill, slow, error) — 평상시**. Netflix Simian Army 의 idea. Litmus / Chaos Mesh / Gremlin.
## 📖 핵심 개념
- "Hope is not a strategy".
- Failure 가 발생할 거 — 평상시 검증.
- Hypothesis-driven (X 죽이면 Y 해야 함).
- Blast radius 작게 시작 (staging → small prod).
## 💻 코드 패턴
### Chaos Monkey 의 idea
```
"무작위 EC2 instance 죽임".
→ 가정: HA 가 동작.
→ 실제 검증: 1 node down 후 service OK?
Netflix 가 만듦 (2010).
```
### Hypothesis 기반 실험
```
"가정: Redis cache 죽으면 latency p99 가 200ms → 500ms 안 (degraded OK)."
실험:
1. Steady state metric 측정 (정상)
2. Redis kill
3. 5 분 측정
4. Restore
5. 결과 분석
```
→ 결과 가 expected = OK. 다름 = bug fix.
### Chaos Mesh (K8s)
```yaml
# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-redis
spec:
action: pod-kill
mode: one
selector:
namespaces: [production]
labelSelectors:
app: redis
scheduler:
cron: '@every 1h'
```
```bash
kubectl apply -f pod-kill.yaml
```
→ 매 1시간 random Redis pod kill.
### Network latency
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: slow-db
spec:
action: delay
mode: one
selector:
labelSelectors: { app: postgres }
delay:
latency: '500ms'
correlation: '50'
jitter: '100ms'
duration: '5m'
```
→ DB 가 500ms slower.
### Network partition
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
spec:
action: partition
direction: both
selector:
labelSelectors: { app: redis }
target:
selector:
labelSelectors: { app: api }
```
→ API 가 Redis 와 통신 X.
### CPU stress
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
spec:
mode: one
selector: { labelSelectors: { app: api } }
stressors:
cpu: { workers: 4, load: 80 }
duration: '10m'
```
### Memory leak
```yaml
stressors:
memory:
workers: 1
size: '1GB'
```
### IO chaos (disk)
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
spec:
action: latency
mode: one
selector: { labelSelectors: { app: postgres } }
delay: '500ms'
path: '/var/lib/postgresql'
```
### HTTP chaos (response 가짜)
```yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
spec:
mode: one
target: Request
port: 80
delay: '5s'
abort: false
selector: { labelSelectors: { app: api } }
```
→ HTTP request 5초 delay.
### Litmus Chaos
```yaml
# litmus experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: api-chaos
spec:
appinfo:
appns: production
applabel: 'app=api'
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
```
→ 60초 동안 매 10초 pod kill.
### Gremlin (managed SaaS)
```bash
# CLI
gremlin attack-container --target docker --type cpu --cpu-percent 80 --length 60
gremlin attack --type latency --length 60 --ms 500
```
→ UI + scenario library.
### Application-level chaos
```ts
// Toxic proxy 또는 in-app
import { toxiproxy } from 'toxiproxy';
// DB connection 가 toxiproxy 통과
const proxy = await toxiproxy.populate({
name: 'postgres',
listen: '127.0.0.1:5433',
upstream: 'postgres:5432',
});
// 500ms latency 추가
await proxy.addToxic({ type: 'latency', attributes: { latency: 500 } });
```
→ 정밀 control.
### Game day
```
1 day 가 chaos 위주.
- Plan: 5 hypothesis
- Execute: 매 1시간 1 실험
- Observe: real-time
- Document: post-mortem 식
→ Team 가 incident 의 운영 학습.
```
### Steady state
```
"정상 = 무엇?" 정의.
Metric:
- Request per minute
- Error rate < 0.1%
- p99 latency < 200ms
- Active user count
→ 실험 시작 전 baseline 측정.
실험 중 vs baseline 비교.
```
### Blast radius
```
1단계: dev 환경
2단계: staging
3단계: prod 의 1% (canary)
4단계: prod 의 10%
5단계: prod 의 100%
→ 단계 별 검증.
```
### Auto-rollback
```ts
// Chaos 실험
async function chaosWithGuard(experiment, abortIf) {
const monitor = setInterval(() => {
if (abortIf()) {
experiment.stop();
log.warn('chaos aborted');
}
}, 5000);
await experiment.run();
clearInterval(monitor);
}
await chaosWithGuard(
killPod('redis'),
() => errorRate() > 0.1,
);
```
→ Error rate > 10% = 즉시 멈춤.
### Common 실험
```
1. 1 instance kill — HA 검증
2. AZ down — multi-AZ 검증
3. DB master kill — failover 검증
4. Cache down — DB 가 견딤?
5. Slow network — timeout / retry?
6. CPU spike — autoscale?
7. Disk full — alert + degrade?
8. Dependency down — fallback?
```
### Failure injection in code
```ts
// Chaos middleware (test 만)
app.use((req, res, next) => {
if (process.env.CHAOS && Math.random() < 0.01) {
return res.status(503).end(); // 1% random 503
}
next();
});
```
→ Staging 에서 chaos = continuous.
### Service mesh chaos (Istio)
```yaml
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
http:
- fault:
delay:
fixedDelay: 5s
percentage: { value: 10.0 }
abort:
httpStatus: 500
percentage: { value: 1.0 }
route:
- destination: { host: backend }
```
→ 10% latency, 1% 500.
### LLM chaos
```ts
// LLM API 가 unreliable
const r = await retryWithBackoff(() => llm.complete(prompt));
// Chaos: API timeout simulate
if (Math.random() < 0.05) await sleep(60_000);
```
### 결과 분석
```
실험 후:
- Steady state 가 회복?
- 회복 시간?
- User-facing impact?
- Alert 가 발생?
- 대응 가 자동?
- 학습 점?
→ Document. Bug fix. Re-run.
```
### Postmortem 식 (자기 자신)
```markdown
## Chaos: 2026-05-09 Redis kill
Hypothesis: Redis down → cache miss → DB latency ↑ but no errors.
Observed:
- p99 latency: 200ms → 1.5s (worse than expected)
- Error rate: 0% → 0.3% (some queries timeout)
- Recovery: 30 sec after Redis up
Action items:
- [ ] DB query timeout 증가 (5s → 10s)
- [ ] Connection pool max 늘리기
- [ ] Re-test
```
### When 시작
```
Pre-req:
- Monitoring (RED metrics)
- Alerting (PagerDuty)
- HA (multi-instance, multi-AZ)
- Runbook
- 팀 가 운영 가능
→ 없음 = chaos 가 진짜 incident 가 됨.
```
### Adoption story
- **Netflix**: Simian Army (2011+).
- **Amazon**: GameDay (Werner Vogels).
- **Slack**: Disasterpiece (annual).
- **LinkedIn**: WaterBear.
- **Gremlin (company)**: Failure-as-a-Service.
## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 작은 팀 / 시작 | Toxiproxy + manual |
| K8s | Chaos Mesh / Litmus |
| Managed | Gremlin |
| Service mesh | Istio fault injection |
| Continuous | Chaos Monkey-style cron |
| Game day | 1 day 별 |
| App-level | In-app middleware |
## ❌ 안티패턴
- **Prod 에서 첫 실험**: incident.
- **Steady state 정의 X**: 결과 모름.
- **Blast radius 가 큰**: 진짜 incident.
- **Auto-abort 없음**: chaos 가 incident.
- **Document 안 함**: 학습 X.
- **Monitoring 없이**: 결과 X.
- **HA 안 갖춰진 system**: chaos 가 부순다.
## 🤖 LLM 활용 힌트
- Chaos = hypothesis 기반 실험.
- Pre-req: monitoring + HA + runbook.
- Blast radius 점진 확대.
- Auto-abort + steady state 가 필수.
## 🔗 관련 문서
- [[Backend_Circuit_Breaker]]
- [[Productivity_Postmortem]]
- [[DevOps_Disaster_Recovery]]