2nd/10_Wiki/Topics/Backend/System_Debugging_Protocol.md

---
id: wiki-2026-0508-system-debugging-protocol
title: System Debugging Protocol
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Debug Workflow, Incident Debugging, SRE Debug Protocol]
duplicate_of: none
source_trust_level: A
confidence_score: 0.85
verification_status: applied
tags: [debugging, sre, incident, observability]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: shell/python
  framework: opentelemetry
---

# System Debugging Protocol

## 매 한 줄
> **"매 production system 의 이상 발생 시 가설 → 측정 → 수정 의 disciplined loop"**. 매 Brendan Gregg USE method, 매 Google SRE Workbook 의 incident protocol, 매 2026 OpenTelemetry + LLM-assisted 의 standard. 매 ad-hoc 디버깅 vs 체계적 protocol 의 결정적 차이.

## 매 핵심

### 매 Phase
1. **Detect**: alert 또는 user report.
2. **Triage**: severity (P0-P4), scope, blast radius.
3. **Stabilize**: rollback / circuit-break — root cause 의 wait 금지.
4. **Diagnose**: USE / RED / TSA method.
5. **Fix**: patch + verify.
6. **Post-mortem**: blameless writeup.

### 매 Diagnostic methods
- **USE (Brendan Gregg)**: 매 resource 의 Utilization / Saturation / Errors.
- **RED**: 매 service 의 Rate / Errors / Duration.
- **TSA (Thread State Analysis)**: 매 CPU / off-CPU profiling.
- **4 golden signals (Google)**: latency, traffic, errors, saturation.

### 매 응용
1. Web service slowness diagnosis.
2. Memory leak hunt.
3. DB lock investigation.
4. Distributed tracing.

## 💻 패턴

### USE method checklist
```bash
# CPU
mpstat -P ALL 1
# Memory
free -m; vmstat 1
# Disk
iostat -xz 1
# Network
sar -n DEV 1
# All saturation in one
top; vmstat 1; iostat -xz 1
```

### Distributed trace via OpenTelemetry
```python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    with tracer.start_as_current_span("db.fetch"):
        order = db.get(order_id)
    with tracer.start_as_current_span("external.payment"):
        charge(order)
```

### Bisection (find regression commit)
```bash
git bisect start
git bisect bad HEAD
git bisect good v2.5.0
# Run test at each step
git bisect run pytest tests/regression.py
git bisect reset
```

### Memory leak hunt (heap snapshot diff)
```bash
# Node.js
node --inspect app.js
# Chrome devtools → Memory → Heap snapshot at T0, T1
# Compare → "Comparison" view → grow-only objects

# Python
import tracemalloc
tracemalloc.start()
# ... run workload ...
snap1 = tracemalloc.take_snapshot()
# ... run more ...
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, 'lineno')[:10]: print(stat)
```

### Flame graph
```bash
# Linux
perf record -F 99 -p $PID -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# Async profiler (JVM, on-CPU + off-CPU)
./profiler.sh -d 30 -f flame.html $PID
```

### SQL slow query (Postgres)
```sql
-- top slow
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 20;

-- live locks
SELECT pid, usename, query, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state != 'idle' ORDER BY xact_start;

-- explain
EXPLAIN (ANALYZE, BUFFERS) SELECT ... ;
```

### Hypothesis-driven debug log
```markdown
## Incident #2026-05-09-01
- T0 12:34: Alert: p99 latency > 2s on /api/checkout
- H1: DB slow → check pg_stat_statements → REJECTED (no slow queries)
- H2: External payment API → trace shows 1.8s in stripe.charge() → CONFIRMED
- Action: circuit-break stripe; queue charges for retry
- T0+15min: latency recovered
```

### Rollback protocol
```bash
# Kubernetes
kubectl rollout undo deployment/api
kubectl rollout status deployment/api

# Feature flag
curl -X POST $LD_API/flags/$KEY/off
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| User-facing impact | Stabilize first (rollback) → diagnose |
| Internal dev env | Diagnose deeply, no rollback urgency |
| Repro available | Local debug + bisect |
| Heisenbug | Production tracing + sampling |
| Memory leak | Heap snapshot diff |
| Latency spike | Distributed trace + flame graph |

**기본값**: Stabilize → Hypothesis → Measure → Fix → Post-mortem.

## 🔗 Graph
- 부모: [[SRE]] · [[Observability]]
- 변형: [[Four Golden Signals]]
- 응용: [[Post-Mortem]] · [[Performance_Profiling_and_Memory|Performance Profiling]]
- Adjacent: [[OpenTelemetry]] · [[Flame Graphs]] · [[Distributed Tracing]]

## 🤖 LLM 활용
**언제**: log pattern 분석, 가설 생성, post-mortem 초안.
**언제 X**: live incident command (human judgment 필요).

## ❌ 안티패턴
- **No stabilization**: 디버깅 중 service down 지속.
- **Multiple changes at once**: 매 fix 의 attribution 불가.
- **Skip post-mortem**: 매 same incident 의 반복.
- **Blame culture**: 매 honest disclosure 의 chilling effect.

## 🧪 검증 / 중복
- Verified (Brendan Gregg USE, Google SRE Book ch.12-14).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Debug protocol full coverage |