Files
2nd/10_Wiki/Topics/Coding/Testing_Chaos_Engineering.md
T
2026-05-10 22:08:15 +09:00

7.9 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
testing-chaos-engineering Chaos Engineering — 의도적 fault injection Coding draft B conceptual 2026-05-09 2026-05-09
testing
chaos
resilience
vibe-coding
language applicable_to
any
Backend
DevOps
chaos engineering
fault injection
Netflix Simian Army
Chaos Monkey
Gremlin
Litmus

Chaos Engineering

시스템 가 prod 의 실제 failure 에 대비. 의도적 fault injection (kill, slow, error) — 평상시. Netflix Simian Army 의 idea. Litmus / Chaos Mesh / Gremlin.

📖 핵심 개념

  • "Hope is not a strategy".
  • Failure 가 발생할 거 — 평상시 검증.
  • Hypothesis-driven (X 죽이면 Y 해야 함).
  • Blast radius 작게 시작 (staging → small prod).

💻 코드 패턴

Chaos Monkey 의 idea

"무작위 EC2 instance 죽임".

→ 가정: HA 가 동작.
→ 실제 검증: 1 node down 후 service OK?

Netflix 가 만듦 (2010).

Hypothesis 기반 실험

"가정: Redis cache 죽으면 latency p99 가 200ms → 500ms 안 (degraded OK)."

실험:
1. Steady state metric 측정 (정상)
2. Redis kill
3. 5 분 측정
4. Restore
5. 결과 분석

→ 결과 가 expected = OK. 다름 = bug fix.

Chaos Mesh (K8s)

# pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-redis
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: [production]
    labelSelectors:
      app: redis
  scheduler:
    cron: '@every 1h'
kubectl apply -f pod-kill.yaml

→ 매 1시간 random Redis pod kill.

Network latency

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: slow-db
spec:
  action: delay
  mode: one
  selector:
    labelSelectors: { app: postgres }
  delay:
    latency: '500ms'
    correlation: '50'
    jitter: '100ms'
  duration: '5m'

→ DB 가 500ms slower.

Network partition

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
spec:
  action: partition
  direction: both
  selector:
    labelSelectors: { app: redis }
  target:
    selector:
      labelSelectors: { app: api }

→ API 가 Redis 와 통신 X.

CPU stress

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
spec:
  mode: one
  selector: { labelSelectors: { app: api } }
  stressors:
    cpu: { workers: 4, load: 80 }
  duration: '10m'

Memory leak

stressors:
  memory:
    workers: 1
    size: '1GB'

IO chaos (disk)

apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
spec:
  action: latency
  mode: one
  selector: { labelSelectors: { app: postgres } }
  delay: '500ms'
  path: '/var/lib/postgresql'

HTTP chaos (response 가짜)

apiVersion: chaos-mesh.org/v1alpha1
kind: HTTPChaos
spec:
  mode: one
  target: Request
  port: 80
  delay: '5s'
  abort: false
  selector: { labelSelectors: { app: api } }

→ HTTP request 5초 delay.

Litmus Chaos

# litmus experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: api-chaos
spec:
  appinfo:
    appns: production
    applabel: 'app=api'
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            - name: CHAOS_INTERVAL
              value: '10'

→ 60초 동안 매 10초 pod kill.

Gremlin (managed SaaS)

# CLI
gremlin attack-container --target docker --type cpu --cpu-percent 80 --length 60

gremlin attack --type latency --length 60 --ms 500

→ UI + scenario library.

Application-level chaos

// Toxic proxy 또는 in-app
import { toxiproxy } from 'toxiproxy';

// DB connection 가 toxiproxy 통과
const proxy = await toxiproxy.populate({
  name: 'postgres',
  listen: '127.0.0.1:5433',
  upstream: 'postgres:5432',
});

// 500ms latency 추가
await proxy.addToxic({ type: 'latency', attributes: { latency: 500 } });

→ 정밀 control.

Game day

1 day 가 chaos 위주.
- Plan: 5 hypothesis
- Execute: 매 1시간 1 실험
- Observe: real-time
- Document: post-mortem 식

→ Team 가 incident 의 운영 학습.

Steady state

"정상 = 무엇?" 정의.

Metric:
- Request per minute
- Error rate < 0.1%
- p99 latency < 200ms
- Active user count

→ 실험 시작 전 baseline 측정.
실험 중 vs baseline 비교.

Blast radius

1단계: dev 환경
2단계: staging
3단계: prod 의 1% (canary)
4단계: prod 의 10%
5단계: prod 의 100%

→ 단계 별 검증.

Auto-rollback

// Chaos 실험
async function chaosWithGuard(experiment, abortIf) {
  const monitor = setInterval(() => {
    if (abortIf()) {
      experiment.stop();
      log.warn('chaos aborted');
    }
  }, 5000);
  
  await experiment.run();
  clearInterval(monitor);
}

await chaosWithGuard(
  killPod('redis'),
  () => errorRate() > 0.1,
);

→ Error rate > 10% = 즉시 멈춤.

Common 실험

1. 1 instance kill — HA 검증
2. AZ down — multi-AZ 검증
3. DB master kill — failover 검증
4. Cache down — DB 가 견딤?
5. Slow network — timeout / retry?
6. CPU spike — autoscale?
7. Disk full — alert + degrade?
8. Dependency down — fallback?

Failure injection in code

// Chaos middleware (test 만)
app.use((req, res, next) => {
  if (process.env.CHAOS && Math.random() < 0.01) {
    return res.status(503).end();  // 1% random 503
  }
  next();
});

→ Staging 에서 chaos = continuous.

Service mesh chaos (Istio)

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
spec:
  http:
    - fault:
        delay:
          fixedDelay: 5s
          percentage: { value: 10.0 }
        abort:
          httpStatus: 500
          percentage: { value: 1.0 }
      route:
        - destination: { host: backend }

→ 10% latency, 1% 500.

LLM chaos

// LLM API 가 unreliable
const r = await retryWithBackoff(() => llm.complete(prompt));

// Chaos: API timeout simulate
if (Math.random() < 0.05) await sleep(60_000);

결과 분석

실험 후:
- Steady state 가 회복?
- 회복 시간?
- User-facing impact?
- Alert 가 발생?
- 대응 가 자동?
- 학습 점?

→ Document. Bug fix. Re-run.

Postmortem 식 (자기 자신)

## Chaos: 2026-05-09 Redis kill

Hypothesis: Redis down → cache miss → DB latency ↑ but no errors.

Observed:
- p99 latency: 200ms → 1.5s (worse than expected)
- Error rate: 0% → 0.3% (some queries timeout)
- Recovery: 30 sec after Redis up

Action items:
- [ ] DB query timeout 증가 (5s → 10s)
- [ ] Connection pool max 늘리기
- [ ] Re-test

When 시작

Pre-req:
- Monitoring (RED metrics)
- Alerting (PagerDuty)
- HA (multi-instance, multi-AZ)
- Runbook
- 팀 가 운영 가능

→ 없음 = chaos 가 진짜 incident 가 됨.

Adoption story

  • Netflix: Simian Army (2011+).
  • Amazon: GameDay (Werner Vogels).
  • Slack: Disasterpiece (annual).
  • LinkedIn: WaterBear.
  • Gremlin (company): Failure-as-a-Service.

🤔 의사결정 기준

상황 추천
작은 팀 / 시작 Toxiproxy + manual
K8s Chaos Mesh / Litmus
Managed Gremlin
Service mesh Istio fault injection
Continuous Chaos Monkey-style cron
Game day 1 day 별
App-level In-app middleware

안티패턴

  • Prod 에서 첫 실험: incident.
  • Steady state 정의 X: 결과 모름.
  • Blast radius 가 큰: 진짜 incident.
  • Auto-abort 없음: chaos 가 incident.
  • Document 안 함: 학습 X.
  • Monitoring 없이: 결과 X.
  • HA 안 갖춰진 system: chaos 가 부순다.

🤖 LLM 활용 힌트

  • Chaos = hypothesis 기반 실험.
  • Pre-req: monitoring + HA + runbook.
  • Blast radius 점진 확대.
  • Auto-abort + steady state 가 필수.

🔗 관련 문서