f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.2 KiB
5.2 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-system-debugging-protocol | System Debugging Protocol | 10_Wiki/Topics | verified | self |
|
none | A | 0.85 | applied |
|
2026-05-10 | pending |
|
System Debugging Protocol
매 한 줄
"매 production system 의 이상 발생 시 가설 → 측정 → 수정 의 disciplined loop". 매 Brendan Gregg USE method, 매 Google SRE Workbook 의 incident protocol, 매 2026 OpenTelemetry + LLM-assisted 의 standard. 매 ad-hoc 디버깅 vs 체계적 protocol 의 결정적 차이.
매 핵심
매 Phase
- Detect: alert 또는 user report.
- Triage: severity (P0-P4), scope, blast radius.
- Stabilize: rollback / circuit-break — root cause 의 wait 금지.
- Diagnose: USE / RED / TSA method.
- Fix: patch + verify.
- Post-mortem: blameless writeup.
매 Diagnostic methods
- USE (Brendan Gregg): 매 resource 의 Utilization / Saturation / Errors.
- RED: 매 service 의 Rate / Errors / Duration.
- TSA (Thread State Analysis): 매 CPU / off-CPU profiling.
- 4 golden signals (Google): latency, traffic, errors, saturation.
매 응용
- Web service slowness diagnosis.
- Memory leak hunt.
- DB lock investigation.
- Distributed tracing.
💻 패턴
USE method checklist
# CPU
mpstat -P ALL 1
# Memory
free -m; vmstat 1
# Disk
iostat -xz 1
# Network
sar -n DEV 1
# All saturation in one
top; vmstat 1; iostat -xz 1
Distributed trace via OpenTelemetry
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
with tracer.start_as_current_span("db.fetch"):
order = db.get(order_id)
with tracer.start_as_current_span("external.payment"):
charge(order)
Bisection (find regression commit)
git bisect start
git bisect bad HEAD
git bisect good v2.5.0
# Run test at each step
git bisect run pytest tests/regression.py
git bisect reset
Memory leak hunt (heap snapshot diff)
# Node.js
node --inspect app.js
# Chrome devtools → Memory → Heap snapshot at T0, T1
# Compare → "Comparison" view → grow-only objects
# Python
import tracemalloc
tracemalloc.start()
# ... run workload ...
snap1 = tracemalloc.take_snapshot()
# ... run more ...
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, 'lineno')[:10]: print(stat)
Flame graph
# Linux
perf record -F 99 -p $PID -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
# Async profiler (JVM, on-CPU + off-CPU)
./profiler.sh -d 30 -f flame.html $PID
SQL slow query (Postgres)
-- top slow
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 20;
-- live locks
SELECT pid, usename, query, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state != 'idle' ORDER BY xact_start;
-- explain
EXPLAIN (ANALYZE, BUFFERS) SELECT ... ;
Hypothesis-driven debug log
## Incident #2026-05-09-01
- T0 12:34: Alert: p99 latency > 2s on /api/checkout
- H1: DB slow → check pg_stat_statements → REJECTED (no slow queries)
- H2: External payment API → trace shows 1.8s in stripe.charge() → CONFIRMED
- Action: circuit-break stripe; queue charges for retry
- T0+15min: latency recovered
Rollback protocol
# Kubernetes
kubectl rollout undo deployment/api
kubectl rollout status deployment/api
# Feature flag
curl -X POST $LD_API/flags/$KEY/off
매 결정 기준
| 상황 | Approach |
|---|---|
| User-facing impact | Stabilize first (rollback) → diagnose |
| Internal dev env | Diagnose deeply, no rollback urgency |
| Repro available | Local debug + bisect |
| Heisenbug | Production tracing + sampling |
| Memory leak | Heap snapshot diff |
| Latency spike | Distributed trace + flame graph |
기본값: Stabilize → Hypothesis → Measure → Fix → Post-mortem.
🔗 Graph
- 부모: SRE · Observability
- 변형: Four Golden Signals
- 응용: Post-Mortem · Performance_Profiling_and_Memory
- Adjacent: OpenTelemetry · Flame Graphs · Distributed Tracing
🤖 LLM 활용
언제: log pattern 분석, 가설 생성, post-mortem 초안. 언제 X: live incident command (human judgment 필요).
❌ 안티패턴
- No stabilization: 디버깅 중 service down 지속.
- Multiple changes at once: 매 fix 의 attribution 불가.
- Skip post-mortem: 매 same incident 의 반복.
- Blame culture: 매 honest disclosure 의 chilling effect.
🧪 검증 / 중복
- Verified (Brendan Gregg USE, Google SRE Book ch.12-14).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Debug protocol full coverage |