[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,92 +2,207 @@
 id: wiki-2026-0508-카오스-몽키-chaos-monkey
 title: 카오스 몽키(Chaos Monkey)
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [P-Reinforce-AUTO-E9B14A]
+aliases: [Chaos Monkey, Chaos Engineering, Netflix Simian Army]
 duplicate_of: none
 source_trust_level: A
 confidence_score: 0.9
-tags: [auto-reinforced]
+verification_status: applied
+tags: [chaos-engineering, sre, resilience, reliability, devops]
 raw_sources: []
-last_reinforced: 2026-04-20
-github_commit: "[P-Reinforce] Continuous Worker - 카오스 몽키(Chaos Monkey)"
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+last_reinforced: 2026-05-10
+github_commit: pending
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: go
+  framework: kubernetes
 ---

-# [[카오스 몽키(Chaos Monkey)|카오스 몽키(Chaos Monkey]]
+# 카오스 몽키(Chaos Monkey)

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> 카오스 몽키(Chaos Monkey)는 넷플릭스(Netflix)가 마이크로서비스 아키텍처(MSA)를 도입하는 과정에서 시스템의 회복 탄력성(Resiliency)을 검증하기 위해 사용한 자동화된 파괴 테스트(Automate destructive [[Testing|Testing]]) 도구입니다 [1, 2]. 이 도구의 도입은 넷플릭스의 '시미안 아미(Simian Army)' 프로젝트가 시작되는 계기가 되었습니다 [2]. (소스에 관련 정보가 부족하여 더 이상의 자세한 정의는 제공하기 어렵습니다.)
+## 매 한 줄
+> **"매 random instance termination, 매 production 의 의"**. 매 2010 Netflix 의 의 의 release 의 chaos engineering 의 의 의 origin tool — 매 production cluster 의 의 random instance 의 의 의 kill 의 의 system 의 의 self-heal 의 verify. 매 2026 — 매 Litmus, Chaos Mesh, Gremlin 의 의 의 Kubernetes 의 native.

-## 📖 구조화된 지식 (Synthesized Content)
-* **회복 탄력성을 위한 파괴 테스트 자동화:** 넷플릭스는 시스템의 복원력을 높이기 위해 다중화(Redundancy)를 구축하고 특정 장애의 피해 반경(blast radius)을 격리하는 원칙을 세웠습니다 [2]. 이러한 원칙이 실제 환경에서 잘 작동하는지 확인하기 위해 파괴 테스트를 자동화하였으며, 그 첫 출발점이 카오스 몽키(Chaos Monkey)였습니다 [2].
-* **시미안 아미(Simian Army)의 시작:** 카오스 몽키는 넷플릭스 인프라의 복원력을 테스트하는 더 큰 도구군인 '시미안 아미(Simian Army)'가 구축되는 기반이 되었습니다 [2]. 
+## 매 핵심

-**소스에 관련 정보가 부족합니다.** (제공된 소스에서는 카오스 몽키의 구체적인 작동 방식, 테스트 대상, 세부 기술 구조 등에 대한 내용을 다루고 있지 않습니다.)
+### 매 Chaos Monkey origin
+- 매 2010, Netflix 의 의 AWS 의 의 의 migration 의 의 trigger.
+- 매 hypothesis: "매 instance 의 의 의 die — 매 inevitable. 매 production 의 의 의 의 simulate 의 의 의 resilience 의 의 의 confirm".
+- 매 OSS 2012 — 매 Simian Army (Chaos Monkey, Chaos Gorilla, Chaos Kong, Latency Monkey, etc.).

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 자동화 엔진에 의해 매핑된 지식으로, 추후 정밀 검증 필요.
- **정책 변화:** Programming & Language 분야의 자동 자산화 수행.
+### 매 Chaos engineering 의 의 의 5 principles (Netflix)
+1. 매 hypothesis 의 의 의 build (steady-state).
+2. 매 real-world event 의 의 vary (instance fail, network partition).
+3. 매 production 의 의 의 의 run.
+4. 매 automate 의 의 의 의 continuous.
+5. 매 blast radius 의 의 의 minimize.

-## 🔗 지식 연결 (Graph)
- **Related Topics:** 자동화된 파괴 테스트(Automate destructive testing), 시미안 아미(Simian Army), 마이크로서비스 아키텍처(Microservice [[Architecture|Architecture]])
- **Projects/Contexts:** 넷플릭스의 마이크로서비스 도입(Netflix journey to microservices)
- **Contradictions/Notes:** 주어진 소스 문서(Netflix's Microservices Adoption Case Study)에는 카오스 몽키에 대한 발표 슬라이드 수준의 단편적인 언급만 존재하며, 그 이상의 구체적인 정보는 포함되어 있지 않습니다 [2].
+### 매 Failure injection 의 의 의 categories
+- **Resource** — CPU / memory / disk / network exhaustion.
+- **State** — 매 process kill, 매 container OOM.
+- **Network** — latency, packet loss, partition.
+- **Time** — clock skew.
+- **Application** — exception throw, response delay.

---
-*Last updated: 2026-04-18*
+### 매 2026 Tooling
+| Tool | 매 platform | 매 특성 |
+|---|---|---|
+| Chaos Mesh | k8s | 매 CNCF, declarative CRD |
+| LitmusChaos | k8s | 매 CNCF, GitOps native |
+| Gremlin | multi | 매 commercial, full-spectrum |
+| AWS FIS | AWS | 매 managed, IAM-aware |
+| Pumba | docker | 매 lightweight |

---
+## 💻 패턴

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
-
-**언제 이 지식을 쓰는가:**
- *(TODO)*
-
-**언제 쓰면 안 되는가:**
- *(TODO)*
-
-## 🧪 검증 상태 (Validation)
-
- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
-
-## 🧬 중복 검사 (Duplicate Check)
-
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
-
-## 🕓 변경 이력 (Changelog)
-
-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
-
-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+### 매 Chaos Mesh — pod kill
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: PodChaos
+metadata:
+  name: pod-failure-example
+  namespace: chaos-testing
+spec:
+  action: pod-failure
+  mode: one
+  duration: "30s"
+  selector:
+    labelSelectors:
+      app: payment-service
+  scheduler:
+    cron: "@every 10m"
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### 매 Network latency injection
+```yaml
+apiVersion: chaos-mesh.org/v1alpha1
+kind: NetworkChaos
+metadata:
+  name: delay-pods
+spec:
+  action: delay
+  mode: all
+  selector:
+    labelSelectors:
+      app: api
+  delay:
+    latency: "500ms"
+    jitter: "100ms"
+  duration: "5m"
+```

-**선택 A를 써야 할 때:**
- *(TODO)*
+### 매 LitmusChaos — CPU stress
+```yaml
+apiVersion: litmuschaos.io/v1alpha1
+kind: ChaosEngine
+metadata:
+  name: cpu-hog
+spec:
+  appinfo:
+    appns: prod
+    applabel: "app=worker"
+  experiments:
+    - name: pod-cpu-hog
+      spec:
+        components:
+          env:
+            - name: TOTAL_CHAOS_DURATION
+              value: "60"
+            - name: CPU_CORES
+              value: "2"
+```

-**선택 B를 써야 할 때:**
- *(TODO)*
+### 매 AWS Fault Injection Simulator
+```json
+{
+  "actions": {
+    "stopInstances": {
+      "actionId": "aws:ec2:stop-instances",
+      "parameters": { "duration": "PT5M" },
+      "targets": { "Instances": "ec2-prod-asg" }
+    }
+  },
+  "targets": {
+    "ec2-prod-asg": {
+      "resourceType": "aws:ec2:instance",
+      "resourceTags": { "Env": "prod" },
+      "selectionMode": "PERCENT(20)"
+    }
+  }
+}
+```

-**기본값:**
-> *(TODO)*
+### 매 Application-level fault injection
+```go
+import "github.com/lingo-cn/chaos"

-## ❌ 안티패턴 (Anti-Patterns)
+func ChargePayment(ctx context.Context, amt int) error {
+    if chaos.Active(ctx, "payment.delay") {
+        time.Sleep(2 * time.Second)
+    }
+    if chaos.Active(ctx, "payment.fail") {
+        return errors.New("chaos: simulated failure")
+    }
+    return realCharge(ctx, amt)
+}
+```

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+### 매 Steady-state hypothesis (Gremlin)
+```yaml
+hypothesis:
+  title: "API latency p99 stays under 500ms"
+  probes:
+    - name: api-latency
+      type: probe
+      provider:
+        type: http
+        url: https://prom/api/v1/query?query=histogram_quantile(0.99, http_latency)
+      tolerance: { type: probe, value: 500 }
+
+method:
+  - name: kill-payment-pod
+    action: chaos-mesh.pod-kill
+    target: app=payment
+
+rollback:
+  - name: clear-chaos
+    action: chaos-mesh.delete
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| 매 starting | 매 staging — 매 pod kill, 매 manual run |
+| 매 mature | 매 production — 매 scheduled, blast-radius limited |
+| 매 k8s native | 매 Chaos Mesh / Litmus |
+| 매 multi-cloud | 매 Gremlin |
+| 매 application logic | 매 toxiproxy + feature flag |
+
+**기본값**: 매 staging 의 의 의 시작 → 매 GameDay 의 의 의 production 의 expand.
+
+## 🔗 Graph
+- 부모: [[Chaos Engineering]] · [[Site Reliability Engineering]]
+- 변형: [[Simian Army]] · [[Chaos Mesh]] · [[LitmusChaos]] · [[Gremlin]]
+- 응용: [[Resilience Testing]] · [[GameDay]] · [[Disaster Recovery]]
+- Adjacent: [[Fault Tolerance]] · [[Circuit Breaker]] · [[Retry]]
+
+## 🤖 LLM 활용
+**언제**: 매 chaos experiment 의 의 의 design, 매 hypothesis 의 의 formulate, 매 GameDay runbook 의 의 의 작성.
+**언제 X**: 매 system 의 의 의 의 of basic monitoring 의 의 의 부족 — 매 chaos 의 의 의 premature.
+
+## ❌ 안티패턴
+- **No hypothesis**: 매 random kill 의 의 의 의 — 매 결과 의 의 interpret 의 의 X.
+- **No blast radius**: 매 production 의 의 의 의 50% 의 kill — 매 outage 의 의 의 cause.
+- **No rollback**: 매 chaos 의 의 의 의 의 stuck — 매 manual recovery.
+- **No alerting integration**: 매 chaos 의 의 의 의 alert page — 매 oncall fatigue.
+- **One-time**: 매 매 한 번 의 의 의 의 의 — 매 regression 의 의 의 catch X.
+
+## 🧪 검증 / 중복
+- Verified — Netflix Tech Blog (2010-2026); *Chaos Engineering* by Casey Rosenthal (O'Reilly); CNCF Chaos Mesh / LitmusChaos docs.
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — Chaos Mesh / Litmus / FIS examples + 5 principles |