--- id: testing-chaos-engineering title: Chaos Engineering — 의도적 fault injection category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [testing, chaos, resilience, vibe-coding] tech_stack: { language: "any", applicable_to: ["Backend", "DevOps"] } applied_in: [] aliases: [chaos engineering, fault injection, Netflix Simian Army, Chaos Monkey, Gremlin, Litmus] --- # Chaos Engineering > 시스템 가 prod 의 실제 failure 에 대비. **의도적 fault injection (kill, slow, error) — 평상시**. Netflix Simian Army 의 idea. Litmus / Chaos Mesh / Gremlin. ## 📖 핵심 개념 - "Hope is not a strategy". - Failure 가 발생할 거 — 평상시 검증. - Hypothesis-driven (X 죽이면 Y 해야 함). - Blast radius 작게 시작 (staging → small prod). ## 💻 코드 패턴 ### Chaos Monkey 의 idea ``` "무작위 EC2 instance 죽임". → 가정: HA 가 동작. → 실제 검증: 1 node down 후 service OK? Netflix 가 만듦 (2010). ``` ### Hypothesis 기반 실험 ``` "가정: Redis cache 죽으면 latency p99 가 200ms → 500ms 안 (degraded OK)." 실험: 1. Steady state metric 측정 (정상) 2. Redis kill 3. 5 분 측정 4. Restore 5. 결과 분석 ``` → 결과 가 expected = OK. 다름 = bug fix. ### Chaos Mesh (K8s) ```yaml # pod-kill.yaml apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: kill-redis spec: action: pod-kill mode: one selector: namespaces: [production] labelSelectors: app: redis scheduler: cron: '@every 1h' ``` ```bash kubectl apply -f pod-kill.yaml ``` → 매 1시간 random Redis pod kill. ### Network latency ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: slow-db spec: action: delay mode: one selector: labelSelectors: { app: postgres } delay: latency: '500ms' correlation: '50' jitter: '100ms' duration: '5m' ``` → DB 가 500ms slower. ### Network partition ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos spec: action: partition direction: both selector: labelSelectors: { app: redis } target: selector: labelSelectors: { app: api } ``` → API 가 Redis 와 통신 X. ### CPU stress ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos spec: mode: one selector: { labelSelectors: { app: api } } stressors: cpu: { workers: 4, load: 80 } duration: '10m' ``` ### Memory leak ```yaml stressors: memory: workers: 1 size: '1GB' ``` ### IO chaos (disk) ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: IOChaos spec: action: latency mode: one selector: { labelSelectors: { app: postgres } } delay: '500ms' path: '/var/lib/postgresql' ``` ### HTTP chaos (response 가짜) ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: HTTPChaos spec: mode: one target: Request port: 80 delay: '5s' abort: false selector: { labelSelectors: { app: api } } ``` → HTTP request 5초 delay. ### Litmus Chaos ```yaml # litmus experiment apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: api-chaos spec: appinfo: appns: production applabel: 'app=api' experiments: - name: pod-delete spec: components: env: - name: TOTAL_CHAOS_DURATION value: '60' - name: CHAOS_INTERVAL value: '10' ``` → 60초 동안 매 10초 pod kill. ### Gremlin (managed SaaS) ```bash # CLI gremlin attack-container --target docker --type cpu --cpu-percent 80 --length 60 gremlin attack --type latency --length 60 --ms 500 ``` → UI + scenario library. ### Application-level chaos ```ts // Toxic proxy 또는 in-app import { toxiproxy } from 'toxiproxy'; // DB connection 가 toxiproxy 통과 const proxy = await toxiproxy.populate({ name: 'postgres', listen: '127.0.0.1:5433', upstream: 'postgres:5432', }); // 500ms latency 추가 await proxy.addToxic({ type: 'latency', attributes: { latency: 500 } }); ``` → 정밀 control. ### Game day ``` 1 day 가 chaos 위주. - Plan: 5 hypothesis - Execute: 매 1시간 1 실험 - Observe: real-time - Document: post-mortem 식 → Team 가 incident 의 운영 학습. ``` ### Steady state ``` "정상 = 무엇?" 정의. Metric: - Request per minute - Error rate < 0.1% - p99 latency < 200ms - Active user count → 실험 시작 전 baseline 측정. 실험 중 vs baseline 비교. ``` ### Blast radius ``` 1단계: dev 환경 2단계: staging 3단계: prod 의 1% (canary) 4단계: prod 의 10% 5단계: prod 의 100% → 단계 별 검증. ``` ### Auto-rollback ```ts // Chaos 실험 async function chaosWithGuard(experiment, abortIf) { const monitor = setInterval(() => { if (abortIf()) { experiment.stop(); log.warn('chaos aborted'); } }, 5000); await experiment.run(); clearInterval(monitor); } await chaosWithGuard( killPod('redis'), () => errorRate() > 0.1, ); ``` → Error rate > 10% = 즉시 멈춤. ### Common 실험 ``` 1. 1 instance kill — HA 검증 2. AZ down — multi-AZ 검증 3. DB master kill — failover 검증 4. Cache down — DB 가 견딤? 5. Slow network — timeout / retry? 6. CPU spike — autoscale? 7. Disk full — alert + degrade? 8. Dependency down — fallback? ``` ### Failure injection in code ```ts // Chaos middleware (test 만) app.use((req, res, next) => { if (process.env.CHAOS && Math.random() < 0.01) { return res.status(503).end(); // 1% random 503 } next(); }); ``` → Staging 에서 chaos = continuous. ### Service mesh chaos (Istio) ```yaml apiVersion: networking.istio.io/v1alpha3 kind: VirtualService spec: http: - fault: delay: fixedDelay: 5s percentage: { value: 10.0 } abort: httpStatus: 500 percentage: { value: 1.0 } route: - destination: { host: backend } ``` → 10% latency, 1% 500. ### LLM chaos ```ts // LLM API 가 unreliable const r = await retryWithBackoff(() => llm.complete(prompt)); // Chaos: API timeout simulate if (Math.random() < 0.05) await sleep(60_000); ``` ### 결과 분석 ``` 실험 후: - Steady state 가 회복? - 회복 시간? - User-facing impact? - Alert 가 발생? - 대응 가 자동? - 학습 점? → Document. Bug fix. Re-run. ``` ### Postmortem 식 (자기 자신) ```markdown ## Chaos: 2026-05-09 Redis kill Hypothesis: Redis down → cache miss → DB latency ↑ but no errors. Observed: - p99 latency: 200ms → 1.5s (worse than expected) - Error rate: 0% → 0.3% (some queries timeout) - Recovery: 30 sec after Redis up Action items: - [ ] DB query timeout 증가 (5s → 10s) - [ ] Connection pool max 늘리기 - [ ] Re-test ``` ### When 시작 ``` Pre-req: - Monitoring (RED metrics) - Alerting (PagerDuty) - HA (multi-instance, multi-AZ) - Runbook - 팀 가 운영 가능 → 없음 = chaos 가 진짜 incident 가 됨. ``` ### Adoption story - **Netflix**: Simian Army (2011+). - **Amazon**: GameDay (Werner Vogels). - **Slack**: Disasterpiece (annual). - **LinkedIn**: WaterBear. - **Gremlin (company)**: Failure-as-a-Service. ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | 작은 팀 / 시작 | Toxiproxy + manual | | K8s | Chaos Mesh / Litmus | | Managed | Gremlin | | Service mesh | Istio fault injection | | Continuous | Chaos Monkey-style cron | | Game day | 1 day 별 | | App-level | In-app middleware | ## ❌ 안티패턴 - **Prod 에서 첫 실험**: incident. - **Steady state 정의 X**: 결과 모름. - **Blast radius 가 큰**: 진짜 incident. - **Auto-abort 없음**: chaos 가 incident. - **Document 안 함**: 학습 X. - **Monitoring 없이**: 결과 X. - **HA 안 갖춰진 system**: chaos 가 부순다. ## 🤖 LLM 활용 힌트 - Chaos = hypothesis 기반 실험. - Pre-req: monitoring + HA + runbook. - Blast radius 점진 확대. - Auto-abort + steady state 가 필수. ## 🔗 관련 문서 - [[Backend_Circuit_Breaker]] - [[Productivity_Postmortem]] - [[DevOps_Disaster_Recovery]]