Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

5.2 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

System Debugging Protocol

매 한 줄

"매 production system 의 이상 발생 시 가설 → 측정 → 수정 의 disciplined loop". 매 Brendan Gregg USE method, 매 Google SRE Workbook 의 incident protocol, 매 2026 OpenTelemetry + LLM-assisted 의 standard. 매 ad-hoc 디버깅 vs 체계적 protocol 의 결정적 차이.

매 핵심

매 Phase

Detect: alert 또는 user report.
Triage: severity (P0-P4), scope, blast radius.
Stabilize: rollback / circuit-break — root cause 의 wait 금지.
Diagnose: USE / RED / TSA method.
Fix: patch + verify.
Post-mortem: blameless writeup.

매 Diagnostic methods

USE (Brendan Gregg): 매 resource 의 Utilization / Saturation / Errors.
RED: 매 service 의 Rate / Errors / Duration.
TSA (Thread State Analysis): 매 CPU / off-CPU profiling.
4 golden signals (Google): latency, traffic, errors, saturation.

매 응용

Web service slowness diagnosis.
Memory leak hunt.
DB lock investigation.
Distributed tracing.

💻 패턴

USE method checklist

# CPU
mpstat -P ALL 1
# Memory
free -m; vmstat 1
# Disk
iostat -xz 1
# Network
sar -n DEV 1
# All saturation in one
top; vmstat 1; iostat -xz 1

Distributed trace via OpenTelemetry

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("process_order")
def process_order(order_id: str):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    with tracer.start_as_current_span("db.fetch"):
        order = db.get(order_id)
    with tracer.start_as_current_span("external.payment"):
        charge(order)

Bisection (find regression commit)

git bisect start
git bisect bad HEAD
git bisect good v2.5.0
# Run test at each step
git bisect run pytest tests/regression.py
git bisect reset

Memory leak hunt (heap snapshot diff)

# Node.js
node --inspect app.js
# Chrome devtools → Memory → Heap snapshot at T0, T1
# Compare → "Comparison" view → grow-only objects

# Python
import tracemalloc
tracemalloc.start()
# ... run workload ...
snap1 = tracemalloc.take_snapshot()
# ... run more ...
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, 'lineno')[:10]: print(stat)

Flame graph

# Linux
perf record -F 99 -p $PID -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# Async profiler (JVM, on-CPU + off-CPU)
./profiler.sh -d 30 -f flame.html $PID

SQL slow query (Postgres)

-- top slow
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 20;

-- live locks
SELECT pid, usename, query, state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state != 'idle' ORDER BY xact_start;

-- explain
EXPLAIN (ANALYZE, BUFFERS) SELECT ... ;

Hypothesis-driven debug log

## Incident #2026-05-09-01
- T0 12:34: Alert: p99 latency > 2s on /api/checkout
- H1: DB slow → check pg_stat_statements → REJECTED (no slow queries)
- H2: External payment API → trace shows 1.8s in stripe.charge() → CONFIRMED
- Action: circuit-break stripe; queue charges for retry
- T0+15min: latency recovered

Rollback protocol

# Kubernetes
kubectl rollout undo deployment/api
kubectl rollout status deployment/api

# Feature flag
curl -X POST $LD_API/flags/$KEY/off

매 결정 기준

상황	Approach
User-facing impact	Stabilize first (rollback) → diagnose
Internal dev env	Diagnose deeply, no rollback urgency
Repro available	Local debug + bisect
Heisenbug	Production tracing + sampling
Memory leak	Heap snapshot diff
Latency spike	Distributed trace + flame graph

기본값: Stabilize → Hypothesis → Measure → Fix → Post-mortem.

🔗 Graph

부모: SRE · Observability
변형: Four Golden Signals
응용: Post-Mortem · Performance_Profiling_and_Memory
Adjacent: OpenTelemetry · Flame Graphs · Distributed Tracing

🤖 LLM 활용

언제: log pattern 분석, 가설 생성, post-mortem 초안. 언제 X: live incident command (human judgment 필요).

❌ 안티패턴

No stabilization: 디버깅 중 service down 지속.
Multiple changes at once: 매 fix 의 attribution 불가.
Skip post-mortem: 매 same incident 의 반복.
Blame culture: 매 honest disclosure 의 chilling effect.

🧪 검증 / 중복

Verified (Brendan Gregg USE, Google SRE Book ch.12-14).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — Debug protocol full coverage

5.2 KiB Raw Blame History