Files
2nd/10_Wiki/Topics/Backend/System_Debugging_Protocol.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.2 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-system-debugging-protocol System Debugging Protocol 10_Wiki/Topics verified self
Debug Workflow
Incident Debugging
SRE Debug Protocol
none A 0.85 applied
debugging
sre
incident
observability
2026-05-10 pending
language framework
shell/python opentelemetry

System Debugging Protocol

매 한 줄

"매 production system 의 이상 발생 시 가설 → 측정 → 수정 의 disciplined loop". 매 Brendan Gregg USE method, 매 Google SRE Workbook 의 incident protocol, 매 2026 OpenTelemetry + LLM-assisted 의 standard. 매 ad-hoc 디버깅 vs 체계적 protocol 의 결정적 차이.

매 핵심

매 Phase

  1. Detect: alert 또는 user report.
  2. Triage: severity (P0-P4), scope, blast radius.
  3. Stabilize: rollback / circuit-break — root cause 의 wait 금지.
  4. Diagnose: USE / RED / TSA method.
  5. Fix: patch + verify.
  6. Post-mortem: blameless writeup.

매 Diagnostic methods

  • USE (Brendan Gregg): 매 resource 의 Utilization / Saturation / Errors.
  • RED: 매 service 의 Rate / Errors / Duration.
  • TSA (Thread State Analysis): 매 CPU / off-CPU profiling.
  • 4 golden signals (Google): latency, traffic, errors, saturation.

매 응용

  1. Web service slowness diagnosis.
  2. Memory leak hunt.
  3. DB lock investigation.
  4. Distributed tracing.

💻 패턴

USE method checklist

# CPU
mpstat -P ALL 1
# Memory
free -m; vmstat 1
# Disk
iostat -xz 1
# Network
sar -n DEV 1
# All saturation in one
top; vmstat 1; iostat -xz 1

Distributed trace via OpenTelemetry

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    with tracer.start_as_current_span("db.fetch"):
        order = db.get(order_id)
    with tracer.start_as_current_span("external.payment"):
        charge(order)

Bisection (find regression commit)

git bisect start
git bisect bad HEAD
git bisect good v2.5.0
# Run test at each step
git bisect run pytest tests/regression.py
git bisect reset

Memory leak hunt (heap snapshot diff)

# Node.js
node --inspect app.js
# Chrome devtools → Memory → Heap snapshot at T0, T1
# Compare → "Comparison" view → grow-only objects

# Python
import tracemalloc
tracemalloc.start()
# ... run workload ...
snap1 = tracemalloc.take_snapshot()
# ... run more ...
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, 'lineno')[:10]: print(stat)

Flame graph

# Linux
perf record -F 99 -p $PID -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# Async profiler (JVM, on-CPU + off-CPU)
./profiler.sh -d 30 -f flame.html $PID

SQL slow query (Postgres)

-- top slow
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 20;

-- live locks
SELECT pid, usename, query, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state != 'idle' ORDER BY xact_start;

-- explain
EXPLAIN (ANALYZE, BUFFERS) SELECT ... ;

Hypothesis-driven debug log

## Incident #2026-05-09-01
- T0 12:34: Alert: p99 latency > 2s on /api/checkout
- H1: DB slow → check pg_stat_statements → REJECTED (no slow queries)
- H2: External payment API → trace shows 1.8s in stripe.charge() → CONFIRMED
- Action: circuit-break stripe; queue charges for retry
- T0+15min: latency recovered

Rollback protocol

# Kubernetes
kubectl rollout undo deployment/api
kubectl rollout status deployment/api

# Feature flag
curl -X POST $LD_API/flags/$KEY/off

매 결정 기준

상황 Approach
User-facing impact Stabilize first (rollback) → diagnose
Internal dev env Diagnose deeply, no rollback urgency
Repro available Local debug + bisect
Heisenbug Production tracing + sampling
Memory leak Heap snapshot diff
Latency spike Distributed trace + flame graph

기본값: Stabilize → Hypothesis → Measure → Fix → Post-mortem.

🔗 Graph

🤖 LLM 활용

언제: log pattern 분석, 가설 생성, post-mortem 초안. 언제 X: live incident command (human judgment 필요).

안티패턴

  • No stabilization: 디버깅 중 service down 지속.
  • Multiple changes at once: 매 fix 의 attribution 불가.
  • Skip post-mortem: 매 same incident 의 반복.
  • Blame culture: 매 honest disclosure 의 chilling effect.

🧪 검증 / 중복

  • Verified (Brendan Gregg USE, Google SRE Book ch.12-14).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Debug protocol full coverage