Files
2nd/10_Wiki/Topics/AI_and_ML/Intentional_Failure_Induction.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.9 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-intentional-failure-induction Intentional Failure Induction 10_Wiki/Topics verified self
Chaos Engineering
Fault Injection
Chaos Monkey
Game Day
none A 0.9 applied
chaos-engineering
sre
resilience
fault-injection
observability
2026-05-10 pending
language framework
yaml chaos-mesh

Intentional Failure Induction

매 한 줄

"매 시스템은 우리가 부수기 전까지는 부서지지 않는 척한다". Netflix 의 Chaos Monkey (2011) 에서 시작된 의도적 장애 주입은, 프로덕션 환경에서 controlled failure 를 통해 hidden coupling 과 fragile assumption 을 발견하는 SRE 의 핵심 도구로 진화했다.

매 핵심

매 원칙 (Principles of Chaos)

  • hypothesis 먼저: "X 가 죽어도 SLO 는 유지된다" 같은 명시적 가설.
  • production-like: staging 보다 production (또는 production-mirror).
  • blast radius minimize: 작게 시작 → 확장.
  • automate: 1회성이 아닌 continuous chaos.
  • stop button: 즉시 중단 가능해야.

매 4 단계

  1. Steady state 정의: SLI/SLO 기준선 (latency p99, error rate).
  2. Hypothesis: 장애 X 시 steady state 유지된다.
  3. Inject: 실제 장애 주입.
  4. Verify / Learn: 가설 검증 + 발견 사항 회고.

매 도구 생태계 (2026)

  • Chaos Mesh (CNCF graduated): Kubernetes-native.
  • LitmusChaos (CNCF): GitOps 친화.
  • Gremlin: 상용 SaaS.
  • AWS FIS (Fault Injection Simulator): AWS-native.
  • Azure Chaos Studio: Azure-native.
  • Steadybit: 통합형, OTel 연동.

매 응용

  1. Cell-based architecture 의 cell isolation 검증.
  2. Multi-region failover drill (Game Day).
  3. Database replica lag 시 application 동작 검증.
  4. ML pipeline 의 데이터 누락/지연 robustness.

💻 패턴

1. Chaos Mesh — Pod kill

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-checkout
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: [shop]
    labelSelectors:
      app: checkout-service
  scheduler:
    cron: "@every 10m"

2. Chaos Mesh — Network latency

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: db-latency
spec:
  action: delay
  mode: all
  selector:
    labelSelectors: { app: postgres }
  delay:
    latency: "200ms"
    correlation: "25"
    jitter: "50ms"
  duration: "5m"

3. AWS FIS — EC2 termination

{
  "description": "Terminate 1 EC2 in prod ASG",
  "targets": {
    "PaymentInstances": {
      "resourceType": "aws:ec2:instance",
      "selectionMode": "COUNT(1)",
      "resourceTags": { "Service": "payment", "Env": "prod" }
    }
  },
  "actions": {
    "TerminateOne": {
      "actionId": "aws:ec2:terminate-instances",
      "targets": { "Instances": "PaymentInstances" }
    }
  },
  "stopConditions": [
    { "source": "aws:cloudwatch:alarm", "value": "arn:...:high-error-rate" }
  ]
}

4. Application-level fault injection (Go)

// using github.com/Netflix/chaosmonkey or custom middleware
func ChaosMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        if rand.Float64() < 0.001 { // 0.1% failure
            time.Sleep(2 * time.Second)
            http.Error(w, "chaos", http.StatusInternalServerError)
            return
        }
        next.ServeHTTP(w, r)
    })
}

5. Toxiproxy — TCP-level corruption

# proxy DB through toxiproxy
toxiproxy-cli create -l 0.0.0.0:5433 -u postgres:5432 db
toxiproxy-cli toxic add -t latency -a latency=500 -a jitter=100 db
toxiproxy-cli toxic add -t bandwidth -a rate=1024 db  # 1KB/s

6. Game Day runbook (Markdown)

## Game Day: Region us-east-1 outage
- Hypothesis: failover to us-west-2 within 90s, no data loss.
- Steady state: p99 < 300ms, 5xx < 0.1%.
- Inject: aws fis start-experiment --id EXPxxx
- Stop condition: 5xx > 1% for > 60s.
- Owner: @sre-oncall · Comms: #incident-game-day

7. Continuous chaos (cron)

apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata: { name: weekly-chaos }
spec:
  schedule: "0 14 * * MON" # 매주 월 14시
  type: PodChaos
  podChaos:
    action: pod-failure
    mode: random-max-percent
    value: "10"
    selector: { labelSelectors: { tier: stateless } }
    duration: "30s"

8. SLO-aware halt (Go + Prometheus)

func shouldHaltChaos(ctx context.Context) bool {
    val, _ := promQuery(ctx, `rate(http_5xx[1m]) / rate(http_total[1m])`)
    return val > 0.01 // 1% 이상이면 즉시 중단
}

9. ML pipeline chaos (Python)

# inject missing features randomly to test fallback
class ChaosFeatureStore:
    def __init__(self, real, p=0.001):
        self.real, self.p = real, p
    def get(self, key):
        if random.random() < self.p:
            raise FeatureMissing(key)
        return self.real.get(key)

10. eBPF-based syscall fault (chaos-mesh KernelChaos)

kind: KernelChaos
spec:
  mode: one
  selector: { labelSelectors: { app: api } }
  failKernRequest:
    callchain:
      - { funcname: "__x64_sys_mount" }
    failtype: 0
    headers: ["linux/mount.h"]
    probability: 100
    times: 1

매 결정 기준

상황 Approach
신규 서비스 Staging 에서 시작 → 점진적 prod
Stateless 다수 Pod kill / Network chaos 부터
Stateful (DB) Replica failover / disk-fill 부터
Multi-region Game Day quarterly
Regulated industry Tabletop → 격리된 cell → prod

기본값: 가설 + SLO halt condition + blast radius 1% 부터 시작, 통과 시 10% → 50% → 100% 단계적 확대.

🔗 Graph

🤖 LLM 활용

언제: 가설/실험 설계, runbook 초안, FIS/Chaos Mesh YAML 생성, 결과 분석 후 액션 아이템 도출. 언제 X: 실제 prod injection 실행 자체는 사람이 검토/승인 — LLM 자동 실행 금지.

안티패턴

  • Steady state 없이 실행: 무엇이 깨졌는지 측정 불가.
  • Halt condition 없는 무인 chaos: 진짜 incident 로 번짐.
  • staging only: prod 만의 coupling 발견 못함.
  • 회고 없는 실험: 학습이 0 — 실험은 결과 정리까지가 한 사이클.
  • 너무 큰 첫 실험: 1 region 전체 down 부터 시작 = 실제 사고.

🧪 검증 / 중복

  • Verified (Principles of Chaos Engineering, Netflix Tech Blog, Chaos Mesh docs 2026).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Chaos Mesh/AWS FIS 패턴 + Game Day runbook