**Status**: resolved **Impact**: <users affected, $ lost, duration> **Severity**: SEV-2 ## Timeline (UTC) - 14:02 alert fired - 14:05 oncall paged - 14:18 root cause identified - 14:31 mitigated ## Root Cause <technical> ## Action Items - [ ] (P0) Fix race in checkout-svc — owner: @x - [ ] (P1) Add SLO alert for queue depth — owner: @y ``` ### Toil tracking ```typescript type Toil = { repetitive: boolean; manual: boolean; automatable: boolean; ts: Date }; // dashboard: toil hours / total hours per quarter, target ≤50% ``` ## 매 결정 기준 | 상황 | SLO | |---|---| | user-facing read API | 99.9% availability, p99 <300ms | | user-facing write API | 99.95% availability, p99 <500ms | | internal batch | 99.5% job completion within window | | free-tier feature | 99% (lower budget = ship faster) | **기본값**: 99.9% availability, multi-burn-rate alerts, weekly error-budget review. ## 🔗 Graph - 부모: [[DevOps]] · [[Production Engineering]] - 변형: [[Platform Engineering]] · [[CI_CD 파이프라인 및 IDE 통합 보안|DevSecOps]] - 응용: [[Observability]] · [[Chaos Engineering]] - Adjacent: [[Prometheus]] · [[OpenTelemetry]] ## 🤖 LLM 활용 **언제**: postmortem drafting from timeline, log anomaly summarization, runbook generation, oncall question answering. **언제 X**: auto-remediation 의 LLM-only — 매 hallucinated kubectl 의 prod 의 destroy. ## ❌ 안티패턴 - **No SLO**: 매 alert noise — 매 every blip 의 page. - **100% uptime goal**: 매 unattainable, 매 budget 0 = no innovation. - **Blame culture**: postmortem 의 finger-pointing — engineers 의 hide incidents. - **Toil unbounded**: SREs 의 burned out — quit within 12mo. ## 🧪 검증 / 중복 - Verified (Google SRE Book, SRE Workbook, Prometheus docs, Sloth SLO generator). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup

--- id: wiki-2026-0508-sre title: SRE category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Site Reliability Engineering, production engineering] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [sre, reliability, slo, observability, devops] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: multi framework: prometheus-grafana-opentelemetry --- # SRE ## 매 한 줄 > **"매 reliability 의 feature 의 — 매 first feature 의"**. SRE (Site Reliability Engineering) 의 Google-originated discipline 의 software engineering 의 ops 의 applying. 핵심: SLOs 의 define, error budgets 의 enforce, toil 의 eliminate, blameless postmortems. ## 매 핵심 ### 매 SRE 의 핵심 의 concepts - **SLI**: 매 measurement (e.g., 200-OK rate over 5min). - **SLO**: 매 target (e.g., 99.9% over 28d rolling). - **SLA**: 매 customer contract (with $ penalty). - **Error budget**: 매 100% - SLO. 매 budget 의 burn 시 release freeze. ### 매 four golden signals (Google) - Latency, Traffic, Errors, Saturation. ### 매 응용 1. SLO-driven alerting (multi-window burn rate). 2. Toil budget (≤50% of SRE time). 3. Blameless postmortem culture. ## 💻 패턴 ### Prometheus SLO recording rules ```yaml groups: - name: slo.rules interval: 30s rules: - record: api:availability:ratio_rate5m expr: | sum(rate(http_requests_total{job="api",code!~"5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])) - record: api:availability:ratio_rate1h expr: | sum(rate(http_requests_total{job="api",code!~"5.."}[1h])) / sum(rate(http_requests_total{job="api"}[1h])) ``` ### Multi-window multi-burn-rate alert ```yaml - alert: ApiErrorBudgetFastBurn expr: | (1 - api:availability:ratio_rate5m) > (14.4 * 0.001) and (1 - api:availability:ratio_rate1h) > (14.4 * 0.001) for: 2m labels: { severity: page } annotations: summary: "Fast burn — 매 2% budget 의 1h 의 consume 의" ``` ### OpenTelemetry instrumentation (Node) ```typescript import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }), instrumentations: [getNodeAutoInstrumentations()], }).start(); ``` ### Runbook automation (Python) ```python import kubernetes.client as k8s def remediate(pod_name: str, ns: str): api = k8s.CoreV1Api() api.delete_namespaced_pod(pod_name, ns) notify_slack(f"매 auto-restart {ns}/{pod_name} (high mem)") ``` ### Postmortem template ```markdown # Incident YYYY-MM-DD: **Status**: resolved **Impact**: <users affected, $ lost, duration> **Severity**: SEV-2 ## Timeline (UTC) - 14:02 alert fired - 14:05 oncall paged - 14:18 root cause identified - 14:31 mitigated ## Root Cause <technical> ## Action Items - [ ] (P0) Fix race in checkout-svc — owner: @x - [ ] (P1) Add SLO alert for queue depth — owner: @y ``` ### Toil tracking ```typescript type Toil = { repetitive: boolean; manual: boolean; automatable: boolean; ts: Date }; // dashboard: toil hours / total hours per quarter, target ≤50% ``` ## 매 결정 기준 | 상황 | SLO | |---|---| | user-facing read API | 99.9% availability, p99 <300ms | | user-facing write API | 99.95% availability, p99 <500ms | | internal batch | 99.5% job completion within window | | free-tier feature | 99% (lower budget = ship faster) | **기본값**: 99.9% availability, multi-burn-rate alerts, weekly error-budget review. ## 🔗 Graph - 부모: [[DevOps]] · [[Production Engineering]] - 변형: [[Platform Engineering]] · [[CI_CD 파이프라인 및 IDE 통합 보안|DevSecOps]] - 응용: [[Observability]] · [[Chaos Engineering]] - Adjacent: [[Prometheus]] · [[OpenTelemetry]] ## 🤖 LLM 활용 **언제**: postmortem drafting from timeline, log anomaly summarization, runbook generation, oncall question answering. **언제 X**: auto-remediation 의 LLM-only — 매 hallucinated kubectl 의 prod 의 destroy. ## ❌ 안티패턴 - **No SLO**: 매 alert noise — 매 every blip 의 page. - **100% uptime goal**: 매 unattainable, 매 budget 0 = no innovation. - **Blame culture**: postmortem 의 finger-pointing — engineers 의 hide incidents. - **Toil unbounded**: SREs 의 burned out — quit within 12mo. ## 🧪 검증 / 중복 - Verified (Google SRE Book, SRE Workbook, Prometheus docs, Sloth SLO generator). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — SLO + burn-rate + OTel patterns |