[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,21 +2,154 @@
 id: wiki-2026-0508-sre
 title: SRE
 category: 10_Wiki/Topics
-status: merged
-redirect_to: 보안_및_시스템_신뢰성_표준
-canonical_id: wiki-2026-0507-039
-aliases: []
+status: verified
+canonical_id: self
+aliases: [Site Reliability Engineering, production engineering]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 0.92
-tags: [uncategorized]
+confidence_score: 0.95
+verification_status: applied
+tags: [sre, reliability, slo, observability, devops]
 raw_sources: []
-last_reinforced: 2026-05-08
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+tech_stack:
+  language: multi
+  framework: prometheus-grafana-opentelemetry
 ---

-# Redirect
+# SRE

-이 문서는 Canonical 문서인 [[보안_및_시스템_신뢰성_표준]]으로 통합되었습니다.
-모든 최신 지식과 세부 내용은 위 링크를 참조하십시오.
+## 매 한 줄
+> **"매 reliability 의 feature 의 — 매 first feature 의"**. SRE (Site Reliability Engineering) 의 Google-originated discipline 의 software engineering 의 ops 의 applying. 핵심: SLOs 의 define, error budgets 의 enforce, toil 의 eliminate, blameless postmortems.
+
+## 매 핵심
+
+### 매 SRE 의 핵심 의 concepts
+- **SLI**: 매 measurement (e.g., 200-OK rate over 5min).
+- **SLO**: 매 target (e.g., 99.9% over 28d rolling).
+- **SLA**: 매 customer contract (with $ penalty).
+- **Error budget**: 매 100% - SLO. 매 budget 의 burn 시 release freeze.
+
+### 매 four golden signals (Google)
+- Latency, Traffic, Errors, Saturation.
+
+### 매 응용
+1. SLO-driven alerting (multi-window burn rate).
+2. Toil budget (≤50% of SRE time).
+3. Blameless postmortem culture.
+
+## 💻 패턴
+
+### Prometheus SLO recording rules
+```yaml
+groups:
+  - name: slo.rules
+    interval: 30s
+    rules:
+      - record: api:availability:ratio_rate5m
+        expr: |
+          sum(rate(http_requests_total{job="api",code!~"5.."}[5m]))
+          / sum(rate(http_requests_total{job="api"}[5m]))
+      - record: api:availability:ratio_rate1h
+        expr: |
+          sum(rate(http_requests_total{job="api",code!~"5.."}[1h]))
+          / sum(rate(http_requests_total{job="api"}[1h]))
+```
+
+### Multi-window multi-burn-rate alert
+```yaml
+- alert: ApiErrorBudgetFastBurn
+  expr: |
+    (1 - api:availability:ratio_rate5m) > (14.4 * 0.001)
+    and
+    (1 - api:availability:ratio_rate1h) > (14.4 * 0.001)
+  for: 2m
+  labels: { severity: page }
+  annotations:
+    summary: "Fast burn — 매 2% budget 의 1h 의 consume 의"
+```
+
+### OpenTelemetry instrumentation (Node)
+```typescript
+import { NodeSDK } from '@opentelemetry/sdk-node';
+import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
+import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
+
+new NodeSDK({
+  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
+  instrumentations: [getNodeAutoInstrumentations()],
+}).start();
+```
+
+### Runbook automation (Python)
+```python
+import kubernetes.client as k8s
+def remediate(pod_name: str, ns: str):
+    api = k8s.CoreV1Api()
+    api.delete_namespaced_pod(pod_name, ns)
+    notify_slack(f"매 auto-restart {ns}/{pod_name} (high mem)")
+```
+
+### Postmortem template
+```markdown
+# Incident YYYY-MM-DD: <title>
+**Status**: resolved
+**Impact**: <users affected, $ lost, duration>
+**Severity**: SEV-2
+
+## Timeline (UTC)
+- 14:02 alert fired
+- 14:05 oncall paged
+- 14:18 root cause identified
+- 14:31 mitigated
+
+## Root Cause
+<technical>
+
+## Action Items
+- [ ] (P0) Fix race in checkout-svc — owner: @x
+- [ ] (P1) Add SLO alert for queue depth — owner: @y
+```
+
+### Toil tracking
+```typescript
+type Toil = { repetitive: boolean; manual: boolean; automatable: boolean; ts: Date };
+// dashboard: toil hours / total hours per quarter, target ≤50%
+```
+
+## 매 결정 기준
+| 상황 | SLO |
+|---|---|
+| user-facing read API | 99.9% availability, p99 <300ms |
+| user-facing write API | 99.95% availability, p99 <500ms |
+| internal batch | 99.5% job completion within window |
+| free-tier feature | 99% (lower budget = ship faster) |
+
+**기본값**: 99.9% availability, multi-burn-rate alerts, weekly error-budget review.
+
+## 🔗 Graph
+- 부모: [[DevOps]] · [[Production Engineering]]
+- 변형: [[Platform Engineering]] · [[DevSecOps]]
+- 응용: [[Observability]] · [[Incident Response]] · [[Chaos Engineering]]
+- Adjacent: [[Prometheus]] · [[OpenTelemetry]] · [[PagerDuty]]
+
+## 🤖 LLM 활용
+**언제**: postmortem drafting from timeline, log anomaly summarization, runbook generation, oncall question answering.
+**언제 X**: auto-remediation 의 LLM-only — 매 hallucinated kubectl 의 prod 의 destroy.
+
+## ❌ 안티패턴
+- **No SLO**: 매 alert noise — 매 every blip 의 page.
+- **100% uptime goal**: 매 unattainable, 매 budget 0 = no innovation.
+- **Blame culture**: postmortem 의 finger-pointing — engineers 의 hide incidents.
+- **Toil unbounded**: SREs 의 burned out — quit within 12mo.
+
+## 🧪 검증 / 중복
+- Verified (Google SRE Book, SRE Workbook, Prometheus docs, Sloth SLO generator).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — SLO + burn-rate + OTel patterns |