[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -2,21 +2,154 @@
|
||||
id: wiki-2026-0508-sre
|
||||
title: SRE
|
||||
category: 10_Wiki/Topics
|
||||
status: merged
|
||||
redirect_to: 보안_및_시스템_신뢰성_표준
|
||||
canonical_id: wiki-2026-0507-039
|
||||
aliases: []
|
||||
status: verified
|
||||
canonical_id: self
|
||||
aliases: [Site Reliability Engineering, production engineering]
|
||||
duplicate_of: none
|
||||
source_trust_level: A
|
||||
confidence_score: 0.92
|
||||
tags: [uncategorized]
|
||||
confidence_score: 0.95
|
||||
verification_status: applied
|
||||
tags: [sre, reliability, slo, observability, devops]
|
||||
raw_sources: []
|
||||
last_reinforced: 2026-05-08
|
||||
last_reinforced: 2026-05-10
|
||||
github_commit: pending
|
||||
inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
|
||||
tech_stack:
|
||||
language: multi
|
||||
framework: prometheus-grafana-opentelemetry
|
||||
---
|
||||
|
||||
# Redirect
|
||||
# SRE
|
||||
|
||||
이 문서는 Canonical 문서인 [[보안_및_시스템_신뢰성_표준]]으로 통합되었습니다.
|
||||
모든 최신 지식과 세부 내용은 위 링크를 참조하십시오.
|
||||
## 매 한 줄
|
||||
> **"매 reliability 의 feature 의 — 매 first feature 의"**. SRE (Site Reliability Engineering) 의 Google-originated discipline 의 software engineering 의 ops 의 applying. 핵심: SLOs 의 define, error budgets 의 enforce, toil 의 eliminate, blameless postmortems.
|
||||
|
||||
## 매 핵심
|
||||
|
||||
### 매 SRE 의 핵심 의 concepts
|
||||
- **SLI**: 매 measurement (e.g., 200-OK rate over 5min).
|
||||
- **SLO**: 매 target (e.g., 99.9% over 28d rolling).
|
||||
- **SLA**: 매 customer contract (with $ penalty).
|
||||
- **Error budget**: 매 100% - SLO. 매 budget 의 burn 시 release freeze.
|
||||
|
||||
### 매 four golden signals (Google)
|
||||
- Latency, Traffic, Errors, Saturation.
|
||||
|
||||
### 매 응용
|
||||
1. SLO-driven alerting (multi-window burn rate).
|
||||
2. Toil budget (≤50% of SRE time).
|
||||
3. Blameless postmortem culture.
|
||||
|
||||
## 💻 패턴
|
||||
|
||||
### Prometheus SLO recording rules
|
||||
```yaml
|
||||
groups:
|
||||
- name: slo.rules
|
||||
interval: 30s
|
||||
rules:
|
||||
- record: api:availability:ratio_rate5m
|
||||
expr: |
|
||||
sum(rate(http_requests_total{job="api",code!~"5.."}[5m]))
|
||||
/ sum(rate(http_requests_total{job="api"}[5m]))
|
||||
- record: api:availability:ratio_rate1h
|
||||
expr: |
|
||||
sum(rate(http_requests_total{job="api",code!~"5.."}[1h]))
|
||||
/ sum(rate(http_requests_total{job="api"}[1h]))
|
||||
```
|
||||
|
||||
### Multi-window multi-burn-rate alert
|
||||
```yaml
|
||||
- alert: ApiErrorBudgetFastBurn
|
||||
expr: |
|
||||
(1 - api:availability:ratio_rate5m) > (14.4 * 0.001)
|
||||
and
|
||||
(1 - api:availability:ratio_rate1h) > (14.4 * 0.001)
|
||||
for: 2m
|
||||
labels: { severity: page }
|
||||
annotations:
|
||||
summary: "Fast burn — 매 2% budget 의 1h 의 consume 의"
|
||||
```
|
||||
|
||||
### OpenTelemetry instrumentation (Node)
|
||||
```typescript
|
||||
import { NodeSDK } from '@opentelemetry/sdk-node';
|
||||
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
|
||||
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
|
||||
|
||||
new NodeSDK({
|
||||
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
|
||||
instrumentations: [getNodeAutoInstrumentations()],
|
||||
}).start();
|
||||
```
|
||||
|
||||
### Runbook automation (Python)
|
||||
```python
|
||||
import kubernetes.client as k8s
|
||||
def remediate(pod_name: str, ns: str):
|
||||
api = k8s.CoreV1Api()
|
||||
api.delete_namespaced_pod(pod_name, ns)
|
||||
notify_slack(f"매 auto-restart {ns}/{pod_name} (high mem)")
|
||||
```
|
||||
|
||||
### Postmortem template
|
||||
```markdown
|
||||
# Incident YYYY-MM-DD: <title>
|
||||
**Status**: resolved
|
||||
**Impact**: <users affected, $ lost, duration>
|
||||
**Severity**: SEV-2
|
||||
|
||||
## Timeline (UTC)
|
||||
- 14:02 alert fired
|
||||
- 14:05 oncall paged
|
||||
- 14:18 root cause identified
|
||||
- 14:31 mitigated
|
||||
|
||||
## Root Cause
|
||||
<technical>
|
||||
|
||||
## Action Items
|
||||
- [ ] (P0) Fix race in checkout-svc — owner: @x
|
||||
- [ ] (P1) Add SLO alert for queue depth — owner: @y
|
||||
```
|
||||
|
||||
### Toil tracking
|
||||
```typescript
|
||||
type Toil = { repetitive: boolean; manual: boolean; automatable: boolean; ts: Date };
|
||||
// dashboard: toil hours / total hours per quarter, target ≤50%
|
||||
```
|
||||
|
||||
## 매 결정 기준
|
||||
| 상황 | SLO |
|
||||
|---|---|
|
||||
| user-facing read API | 99.9% availability, p99 <300ms |
|
||||
| user-facing write API | 99.95% availability, p99 <500ms |
|
||||
| internal batch | 99.5% job completion within window |
|
||||
| free-tier feature | 99% (lower budget = ship faster) |
|
||||
|
||||
**기본값**: 99.9% availability, multi-burn-rate alerts, weekly error-budget review.
|
||||
|
||||
## 🔗 Graph
|
||||
- 부모: [[DevOps]] · [[Production Engineering]]
|
||||
- 변형: [[Platform Engineering]] · [[DevSecOps]]
|
||||
- 응용: [[Observability]] · [[Incident Response]] · [[Chaos Engineering]]
|
||||
- Adjacent: [[Prometheus]] · [[OpenTelemetry]] · [[PagerDuty]]
|
||||
|
||||
## 🤖 LLM 활용
|
||||
**언제**: postmortem drafting from timeline, log anomaly summarization, runbook generation, oncall question answering.
|
||||
**언제 X**: auto-remediation 의 LLM-only — 매 hallucinated kubectl 의 prod 의 destroy.
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **No SLO**: 매 alert noise — 매 every blip 의 page.
|
||||
- **100% uptime goal**: 매 unattainable, 매 budget 0 = no innovation.
|
||||
- **Blame culture**: postmortem 의 finger-pointing — engineers 의 hide incidents.
|
||||
- **Toil unbounded**: SREs 의 burned out — quit within 12mo.
|
||||
|
||||
## 🧪 검증 / 중복
|
||||
- Verified (Google SRE Book, SRE Workbook, Prometheus docs, Sloth SLO generator).
|
||||
- 신뢰도 A.
|
||||
|
||||
## 🕓 Changelog
|
||||
| 날짜 | 변경 |
|
||||
|---|---|
|
||||
| 2026-05-08 | Phase 1 |
|
||||
| 2026-05-10 | Manual cleanup — SLO + burn-rate + OTel patterns |
|
||||
|
||||
Reference in New Issue
Block a user