"매 production system 의 이상 발생 시 가설 → 측정 → 수정 의 disciplined loop". 매 Brendan Gregg USE method, 매 Google SRE Workbook 의 incident protocol, 매 2026 OpenTelemetry + LLM-assisted 의 standard. 매 ad-hoc 디버깅 vs 체계적 protocol 의 결정적 차이.
매 핵심
매 Phase
Detect: alert 또는 user report.
Triage: severity (P0-P4), scope, blast radius.
Stabilize: rollback / circuit-break — root cause 의 wait 금지.
Diagnose: USE / RED / TSA method.
Fix: patch + verify.
Post-mortem: blameless writeup.
매 Diagnostic methods
USE (Brendan Gregg): 매 resource 의 Utilization / Saturation / Errors.
RED: 매 service 의 Rate / Errors / Duration.
TSA (Thread State Analysis): 매 CPU / off-CPU profiling.
4 golden signals (Google): latency, traffic, errors, saturation.
매 응용
Web service slowness diagnosis.
Memory leak hunt.
DB lock investigation.
Distributed tracing.
💻 패턴
USE method checklist
# CPU
mpstat -P ALL 1# Memory
free -m; vmstat 1# Disk
iostat -xz 1# Network
sar -n DEV 1# All saturation in one
top; vmstat 1; iostat -xz 1
-- top slow
SELECTquery,calls,mean_exec_time,total_exec_timeFROMpg_stat_statementsORDERBYtotal_exec_timeDESCLIMIT20;-- live locks
SELECTpid,usename,query,state,wait_event_type,wait_eventFROMpg_stat_activityWHEREstate!='idle'ORDERBYxact_start;-- explain
EXPLAIN(ANALYZE,BUFFERS)SELECT...;
Hypothesis-driven debug log
## Incident #2026-05-09-01
- T0 12:34: Alert: p99 latency > 2s on /api/checkout
- H1: DB slow → check pg_stat_statements → REJECTED (no slow queries)
- H2: External payment API → trace shows 1.8s in stripe.charge() → CONFIRMED
- Action: circuit-break stripe; queue charges for retry
- T0+15min: latency recovered
Rollback protocol
# Kubernetes
kubectl rollout undo deployment/api
kubectl rollout status deployment/api
# Feature flag
curl -X POST $LD_API/flags/$KEY/off
매 결정 기준
상황
Approach
User-facing impact
Stabilize first (rollback) → diagnose
Internal dev env
Diagnose deeply, no rollback urgency
Repro available
Local debug + bisect
Heisenbug
Production tracing + sampling
Memory leak
Heap snapshot diff
Latency spike
Distributed trace + flame graph
기본값: Stabilize → Hypothesis → Measure → Fix → Post-mortem.