Files
2nd/10_Wiki/Topics/Architecture/Distributed Tracing.md
T
2026-05-10 22:08:15 +09:00

7.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-distributed-tracing Distributed Tracing 10_Wiki/Topics verified self
distributed-tracing
opentelemetry-tracing
request-tracing
none A 0.9 applied
observability
tracing
opentelemetry
jaeger
2026-05-10 pending
language framework
typescript opentelemetry/tempo

Distributed Tracing

매 한 줄

"매 trace 가 매 cross-service request 의 X-ray". 매 span tree 가 매 service hop, latency, error 의 reveal — 매 microservice debugging 의 essential. 2026 의 매 OpenTelemetry (OTel) 가 매 universal standard, 매 Tempo / Jaeger / Honeycomb / Datadog 가 매 backend, 매 W3C Trace Context 가 매 propagation.

매 핵심

매 building blocks

  • Trace — 매 root request 의 unique ID (trace_id, 128-bit).
  • Span — 매 single operation (HTTP call, DB query, function); has parent_span_id.
  • Context propagationtraceparent HTTP header (W3C) carries trace_id+span_id+flags.
  • Baggage — key/value propagated alongside (user_id, tenant).
  • Sampling — 매 head (decide at ingress) vs tail (decide after seeing whole trace).

매 OTel architecture

  1. SDK — instrumentation 의 in-app (auto + manual).
  2. Collector — 매 receive → process (batch, sample, redact) → export.
  3. Backend — Tempo (Grafana), Jaeger, Honeycomb, Datadog APM.
  4. UI — Grafana, Jaeger UI, vendor.

매 응용

  1. Latency root cause (which span 의 slow).
  2. Error correlation (trace 의 error=true spans).
  3. Service dependency map (service graph from spans).
  4. Capacity planning (RED metrics derived from spans).
  5. SLO debugging (trace 의 SLO budget burn 의 attribute).

💻 패턴

OTel Node.js auto-instrumentation

// otel.ts (loaded with --import / NODE_OPTIONS)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';

const sdk = new NodeSDK({
  resource: new Resource({ 'service.name': 'orders-api', 'service.version': '1.4.2' }),
  traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Manual span (TypeScript)

import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('orders');

export async function placeOrder(input: OrderInput) {
  return tracer.startActiveSpan('placeOrder', async (span) => {
    span.setAttributes({ 'order.customer_id': input.customerId, 'order.line_count': input.lines.length });
    try {
      const order = await db.orders.insert(input);
      await kafka.produce('orders.placed', order);
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (e) {
      span.recordException(e as Error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: (e as Error).message });
      throw e;
    } finally {
      span.end();
    }
  });
}

Python (FastAPI auto-instrument)

from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317")))
trace.set_tracer_provider(provider)

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()

Trace context propagation (W3C)

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
              ^  ^                                ^                ^
              version  trace_id (32 hex)          parent_id (16)   flags
tracestate: vendor1=value,vendor2=value
baggage: userId=42,tenantId=acme

OTel Collector pipeline

# otel-collector.yaml
receivers:
  otlp: { protocols: { grpc: {}, http: {} } }
processors:
  batch: { timeout: 1s }
  tail_sampling:
    decision_wait: 30s
    policies:
      - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
      - { name: slow, type: latency, latency: { threshold_ms: 1000 } }
      - { name: rest, type: probabilistic, probabilistic: { sampling_percentage: 5 } }
exporters:
  otlphttp/tempo: { endpoint: http://tempo:4318 }
  loki: { endpoint: http://loki:3100/loki/api/v1/push }
service:
  pipelines:
    traces: { receivers: [otlp], processors: [batch, tail_sampling], exporters: [otlphttp/tempo] }

Trace ↔ Log correlation

import pino from 'pino';
import { trace } from '@opentelemetry/api';
const log = pino();
function logWithTrace(msg: string, extra: object = {}) {
  const span = trace.getActiveSpan();
  const ctx = span?.spanContext();
  log.info({ ...extra, trace_id: ctx?.traceId, span_id: ctx?.spanId }, msg);
}
// 매 Loki/Tempo derived field 의 trace 의 jump from log line.

Frontend → backend trace

// 매 browser OTel SDK 의 traceparent 의 inject
import { trace, context } from '@opentelemetry/api';
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
new WebTracerProvider().register();
new FetchInstrumentation({
  propagateTraceHeaderCorsUrls: [/api\.example\.com/],
}).enable();

eBPF-based zero-instrumentation (Beyla / Pixie)

# Grafana Beyla — 매 Go/Node/Python 의 auto-trace 의 eBPF 의 capture, 매 code change X.
beyla --config beyla.yaml

매 결정 기준

상황 Approach
Greenfield OTel SDK + Tempo (Grafana stack)
Multi-cloud SaaS Honeycomb / Datadog APM
Polyglot legacy OTel Collector + auto-instrument per lang
Zero-code start eBPF (Beyla, Pixie)
매 cost control Tail sampling on errors+slow, 5% baseline
Strong cardinality Honeycomb (designed for high-cardinality)

기본값: OTel SDK + W3C Trace Context + Collector with tail sampling + Tempo or vendor backend.

🔗 Graph

🤖 LLM 활용

언제: 매 trace tree 의 root-cause hypothesis, 매 sampling policy review, 매 OTel Collector config debug, 매 span attribute schema design. 언제 X: 매 production sampling decision 의 binding (cost + signal tradeoff 가 deep). 매 PII redaction 의 sole reviewer (security review 필요).

안티패턴

  • No sampling: 매 cost / storage explode — tail sample on errors+slow.
  • High-cardinality on every span: 매 user_id on every span 가 indexable backend 가 X 면 expensive.
  • Frontend trace 의 X: 매 server-side latency 만 가 보임 — 매 user-perceived 의 miss.
  • Logs without trace_id: 매 trace ↔ log jump 가 X.
  • Manual span without end(): 매 leak.
  • Sync span across async boundary: 매 context lost — startActiveSpan 사용.
  • Vendor lock-in via SDK: 매 OTel SDK + vendor exporter 의 use, vendor SDK 의 X.

🧪 검증 / 중복

  • Verified (OpenTelemetry spec, W3C Trace Context, Grafana Tempo docs, Honeycomb engineering blog).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — distributed tracing with OpenTelemetry, sampling, propagation