[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -0,0 +1,401 @@
+---
+id: testing-chaos-engineering
+title: Chaos Engineering — 의도적 fault injection
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [testing, chaos, resilience, vibe-coding]
+tech_stack: { language: "any", applicable_to: ["Backend", "DevOps"] }
+applied_in: []
+aliases: [chaos engineering, fault injection, Netflix Simian Army, Chaos Monkey, Gremlin, Litmus]
+---
+
+# Chaos Engineering
+
+> 시스템 가 prod 의 실제 failure 에 대비. **의도적 fault injection (kill, slow, error) — 평상시**. Netflix Simian Army 의 idea. Litmus / Chaos Mesh / Gremlin.
+
+## 📖 핵심 개념
+- "Hope is not a strategy".
+- Failure 가 발생할 거 — 평상시 검증.
+- Hypothesis-driven (X 죽이면 Y 해야 함).
+- Blast radius 작게 시작 (staging → small prod).
+
+## 💻 코드 패턴
+
+### Chaos Monkey 의 idea
+```
+"무작위 EC2 instance 죽임".
+
+→ 가정: HA 가 동작.
+→ 실제 검증: 1 node down 후 service OK?
+
+Netflix 가 만듦 (2010).
+```
+
+### Hypothesis 기반 실험
+```
+"가정: Redis cache 죽으면 latency p99 가 200ms → 500ms 안 (degraded OK)."
+
+실험:
+1. Steady state metric 측정 (정상)
+2. Redis kill
+3. 5 분 측정
+4. Restore
+5. 결과 분석
+```
+
+→ 결과 가 expected = OK. 다름 = bug fix.
+
+### Chaos Mesh (K8s)
+```yaml
+# pod-kill.yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: PodChaos
+metadata:
+  name: kill-redis
+spec:
+  action: pod-kill
+  mode: one
+  selector:
+    namespaces: [production]
+    labelSelectors:
+      app: redis
+  scheduler:
+    cron: '@every 1h'
+```
+
+```bash
+kubectl apply -f pod-kill.yaml
+```
+
+→ 매 1시간 random Redis pod kill.
+
+### Network latency
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: NetworkChaos
+metadata:
+  name: slow-db
+spec:
+  action: delay
+  mode: one
+  selector:
+    labelSelectors: { app: postgres }
+  delay:
+    latency: '500ms'
+    correlation: '50'
+    jitter: '100ms'
+  duration: '5m'
+```
+
+→ DB 가 500ms slower.
+
+### Network partition
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: NetworkChaos
+spec:
+  action: partition
+  direction: both
+  selector:
+    labelSelectors: { app: redis }
+  target:
+    selector:
+      labelSelectors: { app: api }
+```
+
+→ API 가 Redis 와 통신 X.
+
+### CPU stress
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: StressChaos
+spec:
+  mode: one
+  selector: { labelSelectors: { app: api } }
+  stressors:
+    cpu: { workers: 4, load: 80 }
+  duration: '10m'
+```
+
+### Memory leak
+```yaml
+stressors:
+  memory:
+    workers: 1
+    size: '1GB'
+```
+
+### IO chaos (disk)
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: IOChaos
+spec:
+  action: latency
+  mode: one
+  selector: { labelSelectors: { app: postgres } }
+  delay: '500ms'
+  path: '/var/lib/postgresql'
+```
+
+### HTTP chaos (response 가짜)
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: HTTPChaos
+spec:
+  mode: one
+  target: Request
+  port: 80
+  delay: '5s'
+  abort: false
+  selector: { labelSelectors: { app: api } }
+```
+
+→ HTTP request 5초 delay.
+
+### Litmus Chaos
+```yaml
+# litmus experiment
+apiVersion: litmuschaos.io/v1alpha1
+kind: ChaosEngine
+metadata:
+  name: api-chaos
+spec:
+  appinfo:
+    appns: production
+    applabel: 'app=api'
+  experiments:
+    - name: pod-delete
+      spec:
+        components:
+          env:
+            - name: TOTAL_CHAOS_DURATION
+              value: '60'
+            - name: CHAOS_INTERVAL
+              value: '10'
+```
+
+→ 60초 동안 매 10초 pod kill.
+
+### Gremlin (managed SaaS)
+```bash
+# CLI
+gremlin attack-container --target docker --type cpu --cpu-percent 80 --length 60
+
+gremlin attack --type latency --length 60 --ms 500
+```
+
+→ UI + scenario library.
+
+### Application-level chaos
+```ts
+// Toxic proxy 또는 in-app
+import { toxiproxy } from 'toxiproxy';
+
+// DB connection 가 toxiproxy 통과
+const proxy = await toxiproxy.populate({
+  name: 'postgres',
+  listen: '127.0.0.1:5433',
+  upstream: 'postgres:5432',
+});
+
+// 500ms latency 추가
+await proxy.addToxic({ type: 'latency', attributes: { latency: 500 } });
+```
+
+→ 정밀 control.
+
+### Game day
+```
+1 day 가 chaos 위주.
+- Plan: 5 hypothesis
+- Execute: 매 1시간 1 실험
+- Observe: real-time
+- Document: post-mortem 식
+
+→ Team 가 incident 의 운영 학습.
+```
+
+### Steady state
+```
+"정상 = 무엇?" 정의.
+
+Metric:
+- Request per minute
+- Error rate < 0.1%
+- p99 latency < 200ms
+- Active user count
+
+→ 실험 시작 전 baseline 측정.
+실험 중 vs baseline 비교.
+```
+
+### Blast radius
+```
+1단계: dev 환경
+2단계: staging
+3단계: prod 의 1% (canary)
+4단계: prod 의 10%
+5단계: prod 의 100%
+
+→ 단계 별 검증.
+```
+
+### Auto-rollback
+```ts
+// Chaos 실험
+async function chaosWithGuard(experiment, abortIf) {
+  const monitor = setInterval(() => {
+    if (abortIf()) {
+      experiment.stop();
+      log.warn('chaos aborted');
+    }
+  }, 5000);
+  
+  await experiment.run();
+  clearInterval(monitor);
+}
+
+await chaosWithGuard(
+  killPod('redis'),
+  () => errorRate() > 0.1,
+);
+```
+
+→ Error rate > 10% = 즉시 멈춤.
+
+### Common 실험
+```
+1. 1 instance kill — HA 검증
+2. AZ down — multi-AZ 검증
+3. DB master kill — failover 검증
+4. Cache down — DB 가 견딤?
+5. Slow network — timeout / retry?
+6. CPU spike — autoscale?
+7. Disk full — alert + degrade?
+8. Dependency down — fallback?
+```
+
+### Failure injection in code
+```ts
+// Chaos middleware (test 만)
+app.use((req, res, next) => {
+  if (process.env.CHAOS && Math.random() < 0.01) {
+    return res.status(503).end();  // 1% random 503
+  }
+  next();
+});
+```
+
+→ Staging 에서 chaos = continuous.
+
+### Service mesh chaos (Istio)
+```yaml
+apiVersion: networking.istio.io/v1alpha3
+kind: VirtualService
+spec:
+  http:
+    - fault:
+        delay:
+          fixedDelay: 5s
+          percentage: { value: 10.0 }
+        abort:
+          httpStatus: 500
+          percentage: { value: 1.0 }
+      route:
+        - destination: { host: backend }
+```
+
+→ 10% latency, 1% 500.
+
+### LLM chaos
+```ts
+// LLM API 가 unreliable
+const r = await retryWithBackoff(() => llm.complete(prompt));
+
+// Chaos: API timeout simulate
+if (Math.random() < 0.05) await sleep(60_000);
+```
+
+### 결과 분석
+```
+실험 후:
+- Steady state 가 회복?
+- 회복 시간?
+- User-facing impact?
+- Alert 가 발생?
+- 대응 가 자동?
+- 학습 점?
+
+→ Document. Bug fix. Re-run.
+```
+
+### Postmortem 식 (자기 자신)
+```markdown
+## Chaos: 2026-05-09 Redis kill
+
+Hypothesis: Redis down → cache miss → DB latency ↑ but no errors.
+
+Observed:
+- p99 latency: 200ms → 1.5s (worse than expected)
+- Error rate: 0% → 0.3% (some queries timeout)
+- Recovery: 30 sec after Redis up
+
+Action items:
+- [ ] DB query timeout 증가 (5s → 10s)
+- [ ] Connection pool max 늘리기
+- [ ] Re-test
+```
+
+### When 시작
+```
+Pre-req:
+- Monitoring (RED metrics)
+- Alerting (PagerDuty)
+- HA (multi-instance, multi-AZ)
+- Runbook
+- 팀 가 운영 가능
+
+→ 없음 = chaos 가 진짜 incident 가 됨.
+```
+
+### Adoption story
+- **Netflix**: Simian Army (2011+).
+- **Amazon**: GameDay (Werner Vogels).
+- **Slack**: Disasterpiece (annual).
+- **LinkedIn**: WaterBear.
+- **Gremlin (company)**: Failure-as-a-Service.
+
+## 🤔 의사결정 기준
+| 상황 | 추천 |
+|---|---|
+| 작은 팀 / 시작 | Toxiproxy + manual |
+| K8s | Chaos Mesh / Litmus |
+| Managed | Gremlin |
+| Service mesh | Istio fault injection |
+| Continuous | Chaos Monkey-style cron |
+| Game day | 1 day 별 |
+| App-level | In-app middleware |
+
+## ❌ 안티패턴
+- **Prod 에서 첫 실험**: incident.
+- **Steady state 정의 X**: 결과 모름.
+- **Blast radius 가 큰**: 진짜 incident.
+- **Auto-abort 없음**: chaos 가 incident.
+- **Document 안 함**: 학습 X.
+- **Monitoring 없이**: 결과 X.
+- **HA 안 갖춰진 system**: chaos 가 부순다.
+
+## 🤖 LLM 활용 힌트
+- Chaos = hypothesis 기반 실험.
+- Pre-req: monitoring + HA + runbook.
+- Blast radius 점진 확대.
+- Auto-abort + steady state 가 필수.
+
+## 🔗 관련 문서
+- [[Backend_Circuit_Breaker]]
+- [[Productivity_Postmortem]]
+- [[DevOps_Disaster_Recovery]]