--- id: wiki-2026-0508-distributed-tracing title: Distributed Tracing category: 10_Wiki/Topics status: verified canonical_id: self aliases: [distributed-tracing, opentelemetry-tracing, request-tracing] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [observability, tracing, opentelemetry, jaeger] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: typescript framework: opentelemetry/tempo --- # Distributed Tracing ## 매 한 줄 > **"매 trace 가 매 cross-service request 의 X-ray"**. 매 span tree 가 매 service hop, latency, error 의 reveal — 매 microservice debugging 의 essential. 2026 의 매 OpenTelemetry (OTel) 가 매 universal standard, 매 Tempo / Jaeger / Honeycomb / Datadog 가 매 backend, 매 W3C Trace Context 가 매 propagation. ## 매 핵심 ### 매 building blocks - **Trace** — 매 root request 의 unique ID (trace_id, 128-bit). - **Span** — 매 single operation (HTTP call, DB query, function); has parent_span_id. - **Context propagation** — `traceparent` HTTP header (W3C) carries trace_id+span_id+flags. - **Baggage** — key/value propagated alongside (user_id, tenant). - **Sampling** — 매 head (decide at ingress) vs tail (decide after seeing whole trace). ### 매 OTel architecture 1. **SDK** — instrumentation 의 in-app (auto + manual). 2. **Collector** — 매 receive → process (batch, sample, redact) → export. 3. **Backend** — Tempo (Grafana), Jaeger, Honeycomb, Datadog APM. 4. **UI** — Grafana, Jaeger UI, vendor. ### 매 응용 1. Latency root cause (which span 의 slow). 2. Error correlation (trace 의 `error=true` spans). 3. Service dependency map (service graph from spans). 4. Capacity planning (RED metrics derived from spans). 5. SLO debugging (trace 의 SLO budget burn 의 attribute). ## 💻 패턴 ### OTel Node.js auto-instrumentation ```ts // otel.ts (loaded with --import / NODE_OPTIONS) import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { Resource } from '@opentelemetry/resources'; const sdk = new NodeSDK({ resource: new Resource({ 'service.name': 'orders-api', 'service.version': '1.4.2' }), traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start(); ``` ### Manual span (TypeScript) ```ts import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('orders'); export async function placeOrder(input: OrderInput) { return tracer.startActiveSpan('placeOrder', async (span) => { span.setAttributes({ 'order.customer_id': input.customerId, 'order.line_count': input.lines.length }); try { const order = await db.orders.insert(input); await kafka.produce('orders.placed', order); span.setStatus({ code: SpanStatusCode.OK }); return order; } catch (e) { span.recordException(e as Error); span.setStatus({ code: SpanStatusCode.ERROR, message: (e as Error).message }); throw e; } finally { span.end(); } }); } ``` ### Python (FastAPI auto-instrument) ```python from opentelemetry import trace from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor provider = TracerProvider() provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))) trace.set_tracer_provider(provider) app = FastAPI() FastAPIInstrumentor.instrument_app(app) HTTPXClientInstrumentor().instrument() ``` ### Trace context propagation (W3C) ```text traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 ^ ^ ^ ^ version trace_id (32 hex) parent_id (16) flags tracestate: vendor1=value,vendor2=value baggage: userId=42,tenantId=acme ``` ### OTel Collector pipeline ```yaml # otel-collector.yaml receivers: otlp: { protocols: { grpc: {}, http: {} } } processors: batch: { timeout: 1s } tail_sampling: decision_wait: 30s policies: - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } } - { name: slow, type: latency, latency: { threshold_ms: 1000 } } - { name: rest, type: probabilistic, probabilistic: { sampling_percentage: 5 } } exporters: otlphttp/tempo: { endpoint: http://tempo:4318 } loki: { endpoint: http://loki:3100/loki/api/v1/push } service: pipelines: traces: { receivers: [otlp], processors: [batch, tail_sampling], exporters: [otlphttp/tempo] } ``` ### Trace ↔ Log correlation ```ts import pino from 'pino'; import { trace } from '@opentelemetry/api'; const log = pino(); function logWithTrace(msg: string, extra: object = {}) { const span = trace.getActiveSpan(); const ctx = span?.spanContext(); log.info({ ...extra, trace_id: ctx?.traceId, span_id: ctx?.spanId }, msg); } // 매 Loki/Tempo derived field 의 trace 의 jump from log line. ``` ### Frontend → backend trace ```ts // 매 browser OTel SDK 의 traceparent 의 inject import { trace, context } from '@opentelemetry/api'; import { WebTracerProvider } from '@opentelemetry/sdk-trace-web'; import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch'; new WebTracerProvider().register(); new FetchInstrumentation({ propagateTraceHeaderCorsUrls: [/api\.example\.com/], }).enable(); ``` ### eBPF-based zero-instrumentation (Beyla / Pixie) ```bash # Grafana Beyla — 매 Go/Node/Python 의 auto-trace 의 eBPF 의 capture, 매 code change X. beyla --config beyla.yaml ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Greenfield | OTel SDK + Tempo (Grafana stack) | | Multi-cloud SaaS | Honeycomb / Datadog APM | | Polyglot legacy | OTel Collector + auto-instrument per lang | | Zero-code start | eBPF (Beyla, Pixie) | | 매 cost control | Tail sampling on errors+slow, 5% baseline | | Strong cardinality | Honeycomb (designed for high-cardinality) | **기본값**: OTel SDK + W3C Trace Context + Collector with tail sampling + Tempo or vendor backend. ## 🔗 Graph - 부모: [[Observability]] · [[Microservices]] - 변형: [[Tempo]] - 응용: [[Service Mesh]] - Adjacent: [[OpenTelemetry]] · [[Logs]] ## 🤖 LLM 활용 **언제**: 매 trace tree 의 root-cause hypothesis, 매 sampling policy review, 매 OTel Collector config debug, 매 span attribute schema design. **언제 X**: 매 production sampling decision 의 binding (cost + signal tradeoff 가 deep). 매 PII redaction 의 sole reviewer (security review 필요). ## ❌ 안티패턴 - **No sampling**: 매 cost / storage explode — tail sample on errors+slow. - **High-cardinality on every span**: 매 user_id on every span 가 indexable backend 가 X 면 expensive. - **Frontend trace 의 X**: 매 server-side latency 만 가 보임 — 매 user-perceived 의 miss. - **Logs without trace_id**: 매 trace ↔ log jump 가 X. - **Manual span without `end()`**: 매 leak. - **Sync span across async boundary**: 매 context lost — `startActiveSpan` 사용. - **Vendor lock-in via SDK**: 매 OTel SDK + vendor exporter 의 use, vendor SDK 의 X. ## 🧪 검증 / 중복 - Verified (OpenTelemetry spec, W3C Trace Context, Grafana Tempo docs, Honeycomb engineering blog). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — distributed tracing with OpenTelemetry, sampling, propagation |