Files
2nd/10_Wiki/Topics/DevOps_and_Security/SRE.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

4.7 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-sre SRE 10_Wiki/Topics verified self
Site Reliability Engineering
production engineering
none A 0.95 applied
sre
reliability
slo
observability
devops
2026-05-10 pending
language framework
multi prometheus-grafana-opentelemetry

SRE

매 한 줄

"매 reliability 의 feature 의 — 매 first feature 의". SRE (Site Reliability Engineering) 의 Google-originated discipline 의 software engineering 의 ops 의 applying. 핵심: SLOs 의 define, error budgets 의 enforce, toil 의 eliminate, blameless postmortems.

매 핵심

매 SRE 의 핵심 의 concepts

  • SLI: 매 measurement (e.g., 200-OK rate over 5min).
  • SLO: 매 target (e.g., 99.9% over 28d rolling).
  • SLA: 매 customer contract (with $ penalty).
  • Error budget: 매 100% - SLO. 매 budget 의 burn 시 release freeze.

매 four golden signals (Google)

  • Latency, Traffic, Errors, Saturation.

매 응용

  1. SLO-driven alerting (multi-window burn rate).
  2. Toil budget (≤50% of SRE time).
  3. Blameless postmortem culture.

💻 패턴

Prometheus SLO recording rules

groups:
  - name: slo.rules
    interval: 30s
    rules:
      - record: api:availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{job="api",code!~"5.."}[5m]))
          / sum(rate(http_requests_total{job="api"}[5m]))
      - record: api:availability:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{job="api",code!~"5.."}[1h]))
          / sum(rate(http_requests_total{job="api"}[1h]))

Multi-window multi-burn-rate alert

- alert: ApiErrorBudgetFastBurn
  expr: |
    (1 - api:availability:ratio_rate5m) > (14.4 * 0.001)
    and
    (1 - api:availability:ratio_rate1h) > (14.4 * 0.001)
  for: 2m
  labels: { severity: page }
  annotations:
    summary: "Fast burn — 매 2% budget 의 1h 의 consume 의"

OpenTelemetry instrumentation (Node)

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
  instrumentations: [getNodeAutoInstrumentations()],
}).start();

Runbook automation (Python)

import kubernetes.client as k8s
def remediate(pod_name: str, ns: str):
    api = k8s.CoreV1Api()
    api.delete_namespaced_pod(pod_name, ns)
    notify_slack(f"매 auto-restart {ns}/{pod_name} (high mem)")

Postmortem template

# Incident YYYY-MM-DD: <title>
**Status**: resolved
**Impact**: <users affected, $ lost, duration>
**Severity**: SEV-2

## Timeline (UTC)
- 14:02 alert fired
- 14:05 oncall paged
- 14:18 root cause identified
- 14:31 mitigated

## Root Cause
<technical>

## Action Items
- [ ] (P0) Fix race in checkout-svc — owner: @x
- [ ] (P1) Add SLO alert for queue depth — owner: @y

Toil tracking

type Toil = { repetitive: boolean; manual: boolean; automatable: boolean; ts: Date };
// dashboard: toil hours / total hours per quarter, target ≤50%

매 결정 기준

상황 SLO
user-facing read API 99.9% availability, p99 <300ms
user-facing write API 99.95% availability, p99 <500ms
internal batch 99.5% job completion within window
free-tier feature 99% (lower budget = ship faster)

기본값: 99.9% availability, multi-burn-rate alerts, weekly error-budget review.

🔗 Graph

🤖 LLM 활용

언제: postmortem drafting from timeline, log anomaly summarization, runbook generation, oncall question answering. 언제 X: auto-remediation 의 LLM-only — 매 hallucinated kubectl 의 prod 의 destroy.

안티패턴

  • No SLO: 매 alert noise — 매 every blip 의 page.
  • 100% uptime goal: 매 unattainable, 매 budget 0 = no innovation.
  • Blame culture: postmortem 의 finger-pointing — engineers 의 hide incidents.
  • Toil unbounded: SREs 의 burned out — quit within 12mo.

🧪 검증 / 중복

  • Verified (Google SRE Book, SRE Workbook, Prometheus docs, Sloth SLO generator).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — SLO + burn-rate + OTel patterns