--- id: wiki-2026-0508-intentional-failure-induction title: Intentional Failure Induction category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Chaos Engineering, Fault Injection, Chaos Monkey, Game Day] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [chaos-engineering, sre, resilience, fault-injection, observability] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: yaml framework: chaos-mesh --- # Intentional Failure Induction ## 매 한 줄 > **"매 시스템은 우리가 부수기 전까지는 부서지지 않는 척한다"**. Netflix 의 Chaos Monkey (2011) 에서 시작된 의도적 장애 주입은, 프로덕션 환경에서 controlled failure 를 통해 hidden coupling 과 fragile assumption 을 발견하는 SRE 의 핵심 도구로 진화했다. ## 매 핵심 ### 매 원칙 (Principles of Chaos) - **hypothesis 먼저**: "X 가 죽어도 SLO 는 유지된다" 같은 명시적 가설. - **production-like**: staging 보다 production (또는 production-mirror). - **blast radius minimize**: 작게 시작 → 확장. - **automate**: 1회성이 아닌 continuous chaos. - **stop button**: 즉시 중단 가능해야. ### 매 4 단계 1. **Steady state 정의**: SLI/SLO 기준선 (latency p99, error rate). 2. **Hypothesis**: 장애 X 시 steady state 유지된다. 3. **Inject**: 실제 장애 주입. 4. **Verify / Learn**: 가설 검증 + 발견 사항 회고. ### 매 도구 생태계 (2026) - **Chaos Mesh** (CNCF graduated): Kubernetes-native. - **LitmusChaos** (CNCF): GitOps 친화. - **Gremlin**: 상용 SaaS. - **AWS FIS** (Fault Injection Simulator): AWS-native. - **Azure Chaos Studio**: Azure-native. - **Steadybit**: 통합형, OTel 연동. ### 매 응용 1. Cell-based architecture 의 cell isolation 검증. 2. Multi-region failover drill (Game Day). 3. Database replica lag 시 application 동작 검증. 4. ML pipeline 의 데이터 누락/지연 robustness. ## 💻 패턴 ### 1. Chaos Mesh — Pod kill ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-kill-checkout namespace: chaos-testing spec: action: pod-kill mode: one selector: namespaces: [shop] labelSelectors: app: checkout-service scheduler: cron: "@every 10m" ``` ### 2. Chaos Mesh — Network latency ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: db-latency spec: action: delay mode: all selector: labelSelectors: { app: postgres } delay: latency: "200ms" correlation: "25" jitter: "50ms" duration: "5m" ``` ### 3. AWS FIS — EC2 termination ```json { "description": "Terminate 1 EC2 in prod ASG", "targets": { "PaymentInstances": { "resourceType": "aws:ec2:instance", "selectionMode": "COUNT(1)", "resourceTags": { "Service": "payment", "Env": "prod" } } }, "actions": { "TerminateOne": { "actionId": "aws:ec2:terminate-instances", "targets": { "Instances": "PaymentInstances" } } }, "stopConditions": [ { "source": "aws:cloudwatch:alarm", "value": "arn:...:high-error-rate" } ] } ``` ### 4. Application-level fault injection (Go) ```go // using github.com/Netflix/chaosmonkey or custom middleware func ChaosMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { if rand.Float64() < 0.001 { // 0.1% failure time.Sleep(2 * time.Second) http.Error(w, "chaos", http.StatusInternalServerError) return } next.ServeHTTP(w, r) }) } ``` ### 5. Toxiproxy — TCP-level corruption ```bash # proxy DB through toxiproxy toxiproxy-cli create -l 0.0.0.0:5433 -u postgres:5432 db toxiproxy-cli toxic add -t latency -a latency=500 -a jitter=100 db toxiproxy-cli toxic add -t bandwidth -a rate=1024 db # 1KB/s ``` ### 6. Game Day runbook (Markdown) ```markdown ## Game Day: Region us-east-1 outage - Hypothesis: failover to us-west-2 within 90s, no data loss. - Steady state: p99 < 300ms, 5xx < 0.1%. - Inject: aws fis start-experiment --id EXPxxx - Stop condition: 5xx > 1% for > 60s. - Owner: @sre-oncall · Comms: #incident-game-day ``` ### 7. Continuous chaos (cron) ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: Schedule metadata: { name: weekly-chaos } spec: schedule: "0 14 * * MON" # 매주 월 14시 type: PodChaos podChaos: action: pod-failure mode: random-max-percent value: "10" selector: { labelSelectors: { tier: stateless } } duration: "30s" ``` ### 8. SLO-aware halt (Go + Prometheus) ```go func shouldHaltChaos(ctx context.Context) bool { val, _ := promQuery(ctx, `rate(http_5xx[1m]) / rate(http_total[1m])`) return val > 0.01 // 1% 이상이면 즉시 중단 } ``` ### 9. ML pipeline chaos (Python) ```python # inject missing features randomly to test fallback class ChaosFeatureStore: def __init__(self, real, p=0.001): self.real, self.p = real, p def get(self, key): if random.random() < self.p: raise FeatureMissing(key) return self.real.get(key) ``` ### 10. eBPF-based syscall fault (chaos-mesh KernelChaos) ```yaml kind: KernelChaos spec: mode: one selector: { labelSelectors: { app: api } } failKernRequest: callchain: - { funcname: "__x64_sys_mount" } failtype: 0 headers: ["linux/mount.h"] probability: 100 times: 1 ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 신규 서비스 | Staging 에서 시작 → 점진적 prod | | Stateless 다수 | Pod kill / Network chaos 부터 | | Stateful (DB) | Replica failover / disk-fill 부터 | | Multi-region | Game Day quarterly | | Regulated industry | Tabletop → 격리된 cell → prod | **기본값**: 가설 + SLO halt condition + blast radius 1% 부터 시작, 통과 시 10% → 50% → 100% 단계적 확대. ## 🔗 Graph - 부모: [[SRE]] · [[Resilience-Engineering]] - 변형: [[Game-Day]] - 응용: [[Cell-Architecture]] - Adjacent: [[Observability]] · [[SLO]] · [[Postmortem]] ## 🤖 LLM 활용 **언제**: 가설/실험 설계, runbook 초안, FIS/Chaos Mesh YAML 생성, 결과 분석 후 액션 아이템 도출. **언제 X**: 실제 prod injection 실행 자체는 사람이 검토/승인 — LLM 자동 실행 금지. ## ❌ 안티패턴 - **Steady state 없이 실행**: 무엇이 깨졌는지 측정 불가. - **Halt condition 없는 무인 chaos**: 진짜 incident 로 번짐. - **staging only**: prod 만의 coupling 발견 못함. - **회고 없는 실험**: 학습이 0 — 실험은 결과 정리까지가 한 사이클. - **너무 큰 첫 실험**: 1 region 전체 down 부터 시작 = 실제 사고. ## 🧪 검증 / 중복 - Verified (Principles of Chaos Engineering, Netflix Tech Blog, Chaos Mesh docs 2026). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Chaos Mesh/AWS FIS 패턴 + Game Day runbook |