Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

6.9 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Intentional Failure Induction

매 한 줄

"매 시스템은 우리가 부수기 전까지는 부서지지 않는 척한다". Netflix 의 Chaos Monkey (2011) 에서 시작된 의도적 장애 주입은, 프로덕션 환경에서 controlled failure 를 통해 hidden coupling 과 fragile assumption 을 발견하는 SRE 의 핵심 도구로 진화했다.

매 핵심

매 원칙 (Principles of Chaos)

hypothesis 먼저: "X 가 죽어도 SLO 는 유지된다" 같은 명시적 가설.
production-like: staging 보다 production (또는 production-mirror).
blast radius minimize: 작게 시작 → 확장.
automate: 1회성이 아닌 continuous chaos.
stop button: 즉시 중단 가능해야.

매 4 단계

Steady state 정의: SLI/SLO 기준선 (latency p99, error rate).
Hypothesis: 장애 X 시 steady state 유지된다.
Inject: 실제 장애 주입.
Verify / Learn: 가설 검증 + 발견 사항 회고.

매 도구 생태계 (2026)

Chaos Mesh (CNCF graduated): Kubernetes-native.
LitmusChaos (CNCF): GitOps 친화.
Gremlin: 상용 SaaS.
AWS FIS (Fault Injection Simulator): AWS-native.
Azure Chaos Studio: Azure-native.
Steadybit: 통합형, OTel 연동.

매 응용

Cell-based architecture 의 cell isolation 검증.
Multi-region failover drill (Game Day).
Database replica lag 시 application 동작 검증.
ML pipeline 의 데이터 누락/지연 robustness.

💻 패턴

1. Chaos Mesh — Pod kill

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-checkout
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: [shop]
    labelSelectors:
      app: checkout-service
  scheduler:
    cron: "@every 10m"

2. Chaos Mesh — Network latency

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: db-latency
spec:
  action: delay
  mode: all
  selector:
    labelSelectors: { app: postgres }
  delay:
    latency: "200ms"
    correlation: "25"
    jitter: "50ms"
  duration: "5m"

3. AWS FIS — EC2 termination

{
  "description": "Terminate 1 EC2 in prod ASG",
  "targets": {
    "PaymentInstances": {
      "resourceType": "aws:ec2:instance",
      "selectionMode": "COUNT(1)",
      "resourceTags": { "Service": "payment", "Env": "prod" }
    }
  },
  "actions": {
    "TerminateOne": {
      "actionId": "aws:ec2:terminate-instances",
      "targets": { "Instances": "PaymentInstances" }
    }
  },
  "stopConditions": [
    { "source": "aws:cloudwatch:alarm", "value": "arn:...:high-error-rate" }
  ]
}

4. Application-level fault injection (Go)

// using github.com/Netflix/chaosmonkey or custom middleware
func ChaosMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        if rand.Float64() < 0.001 { // 0.1% failure
            time.Sleep(2 * time.Second)
            http.Error(w, "chaos", http.StatusInternalServerError)
            return
        }
        next.ServeHTTP(w, r)
    })
}

5. Toxiproxy — TCP-level corruption

# proxy DB through toxiproxy
toxiproxy-cli create -l 0.0.0.0:5433 -u postgres:5432 db
toxiproxy-cli toxic add -t latency -a latency=500 -a jitter=100 db
toxiproxy-cli toxic add -t bandwidth -a rate=1024 db  # 1KB/s

6. Game Day runbook (Markdown)

## Game Day: Region us-east-1 outage
- Hypothesis: failover to us-west-2 within 90s, no data loss.
- Steady state: p99 < 300ms, 5xx < 0.1%.
- Inject: aws fis start-experiment --id EXPxxx
- Stop condition: 5xx > 1% for > 60s.
- Owner: @sre-oncall · Comms: #incident-game-day

7. Continuous chaos (cron)

apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata: { name: weekly-chaos }
spec:
  schedule: "0 14 * * MON" # 매주 월 14시
  type: PodChaos
  podChaos:
    action: pod-failure
    mode: random-max-percent
    value: "10"
    selector: { labelSelectors: { tier: stateless } }
    duration: "30s"

8. SLO-aware halt (Go + Prometheus)

func shouldHaltChaos(ctx context.Context) bool {
    val, _ := promQuery(ctx, `rate(http_5xx[1m]) / rate(http_total[1m])`)
    return val > 0.01 // 1% 이상이면 즉시 중단
}

9. ML pipeline chaos (Python)

# inject missing features randomly to test fallback
class ChaosFeatureStore:
    def __init__(self, real, p=0.001):
        self.real, self.p = real, p
    def get(self, key):
        if random.random() < self.p:
            raise FeatureMissing(key)
        return self.real.get(key)

10. eBPF-based syscall fault (chaos-mesh KernelChaos)

kind: KernelChaos
spec:
  mode: one
  selector: { labelSelectors: { app: api } }
  failKernRequest:
    callchain:
      - { funcname: "__x64_sys_mount" }
    failtype: 0
    headers: ["linux/mount.h"]
    probability: 100
    times: 1

매 결정 기준

상황	Approach
신규 서비스	Staging 에서 시작 → 점진적 prod
Stateless 다수	Pod kill / Network chaos 부터
Stateful (DB)	Replica failover / disk-fill 부터
Multi-region	Game Day quarterly
Regulated industry	Tabletop → 격리된 cell → prod

기본값: 가설 + SLO halt condition + blast radius 1% 부터 시작, 통과 시 10% → 50% → 100% 단계적 확대.

🔗 Graph

부모: SRE · Resilience-Engineering
변형: Game-Day
응용: Cell-Architecture
Adjacent: Observability · SLO · Postmortem

🤖 LLM 활용

언제: 가설/실험 설계, runbook 초안, FIS/Chaos Mesh YAML 생성, 결과 분석 후 액션 아이템 도출. 언제 X: 실제 prod injection 실행 자체는 사람이 검토/승인 — LLM 자동 실행 금지.

❌ 안티패턴

Steady state 없이 실행: 무엇이 깨졌는지 측정 불가.
Halt condition 없는 무인 chaos: 진짜 incident 로 번짐.
staging only: prod 만의 coupling 발견 못함.
회고 없는 실험: 학습이 0 — 실험은 결과 정리까지가 한 사이클.
너무 큰 첫 실험: 1 region 전체 down 부터 시작 = 실제 사고.

🧪 검증 / 중복

Verified (Principles of Chaos Engineering, Netflix Tech Blog, Chaos Mesh docs 2026).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — Chaos Mesh/AWS FIS 패턴 + Game Day runbook

6.9 KiB Raw Blame History