--- id: wiki-2026-0508-system-debugging-protocol title: System Debugging Protocol category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Debug Workflow, Incident Debugging, SRE Debug Protocol] duplicate_of: none source_trust_level: A confidence_score: 0.85 verification_status: applied tags: [debugging, sre, incident, observability] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: shell/python framework: opentelemetry --- # System Debugging Protocol ## 매 한 줄 > **"매 production system 의 이상 발생 시 가설 → 측정 → 수정 의 disciplined loop"**. 매 Brendan Gregg USE method, 매 Google SRE Workbook 의 incident protocol, 매 2026 OpenTelemetry + LLM-assisted 의 standard. 매 ad-hoc 디버깅 vs 체계적 protocol 의 결정적 차이. ## 매 핵심 ### 매 Phase 1. **Detect**: alert 또는 user report. 2. **Triage**: severity (P0-P4), scope, blast radius. 3. **Stabilize**: rollback / circuit-break — root cause 의 wait 금지. 4. **Diagnose**: USE / RED / TSA method. 5. **Fix**: patch + verify. 6. **Post-mortem**: blameless writeup. ### 매 Diagnostic methods - **USE (Brendan Gregg)**: 매 resource 의 Utilization / Saturation / Errors. - **RED**: 매 service 의 Rate / Errors / Duration. - **TSA (Thread State Analysis)**: 매 CPU / off-CPU profiling. - **4 golden signals (Google)**: latency, traffic, errors, saturation. ### 매 응용 1. Web service slowness diagnosis. 2. Memory leak hunt. 3. DB lock investigation. 4. Distributed tracing. ## 💻 패턴 ### USE method checklist ```bash # CPU mpstat -P ALL 1 # Memory free -m; vmstat 1 # Disk iostat -xz 1 # Network sar -n DEV 1 # All saturation in one top; vmstat 1; iostat -xz 1 ``` ### Distributed trace via OpenTelemetry ```python from opentelemetry import trace tracer = trace.get_tracer(__name__) @tracer.start_as_current_span("process_order") def process_order(order_id: str): span = trace.get_current_span() span.set_attribute("order.id", order_id) with tracer.start_as_current_span("db.fetch"): order = db.get(order_id) with tracer.start_as_current_span("external.payment"): charge(order) ``` ### Bisection (find regression commit) ```bash git bisect start git bisect bad HEAD git bisect good v2.5.0 # Run test at each step git bisect run pytest tests/regression.py git bisect reset ``` ### Memory leak hunt (heap snapshot diff) ```bash # Node.js node --inspect app.js # Chrome devtools → Memory → Heap snapshot at T0, T1 # Compare → "Comparison" view → grow-only objects # Python import tracemalloc tracemalloc.start() # ... run workload ... snap1 = tracemalloc.take_snapshot() # ... run more ... snap2 = tracemalloc.take_snapshot() for stat in snap2.compare_to(snap1, 'lineno')[:10]: print(stat) ``` ### Flame graph ```bash # Linux perf record -F 99 -p $PID -g -- sleep 30 perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg # Async profiler (JVM, on-CPU + off-CPU) ./profiler.sh -d 30 -f flame.html $PID ``` ### SQL slow query (Postgres) ```sql -- top slow SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 20; -- live locks SELECT pid, usename, query, state, wait_event_type, wait_event FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start; -- explain EXPLAIN (ANALYZE, BUFFERS) SELECT ... ; ``` ### Hypothesis-driven debug log ```markdown ## Incident #2026-05-09-01 - T0 12:34: Alert: p99 latency > 2s on /api/checkout - H1: DB slow → check pg_stat_statements → REJECTED (no slow queries) - H2: External payment API → trace shows 1.8s in stripe.charge() → CONFIRMED - Action: circuit-break stripe; queue charges for retry - T0+15min: latency recovered ``` ### Rollback protocol ```bash # Kubernetes kubectl rollout undo deployment/api kubectl rollout status deployment/api # Feature flag curl -X POST $LD_API/flags/$KEY/off ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | User-facing impact | Stabilize first (rollback) → diagnose | | Internal dev env | Diagnose deeply, no rollback urgency | | Repro available | Local debug + bisect | | Heisenbug | Production tracing + sampling | | Memory leak | Heap snapshot diff | | Latency spike | Distributed trace + flame graph | **기본값**: Stabilize → Hypothesis → Measure → Fix → Post-mortem. ## 🔗 Graph - 부모: [[SRE]] · [[Observability]] - 변형: [[Four Golden Signals]] - 응용: [[Post-Mortem]] · [[Performance_Profiling_and_Memory|Performance Profiling]] - Adjacent: [[OpenTelemetry]] · [[Flame Graphs]] · [[Distributed Tracing]] ## 🤖 LLM 활용 **언제**: log pattern 분석, 가설 생성, post-mortem 초안. **언제 X**: live incident command (human judgment 필요). ## ❌ 안티패턴 - **No stabilization**: 디버깅 중 service down 지속. - **Multiple changes at once**: 매 fix 의 attribution 불가. - **Skip post-mortem**: 매 same incident 의 반복. - **Blame culture**: 매 honest disclosure 의 chilling effect. ## 🧪 검증 / 중복 - Verified (Brendan Gregg USE, Google SRE Book ch.12-14). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Debug protocol full coverage |