chore(brain): ASTRA 성장 자산 동기화 — 기능 인벤토리·growth(약점프로필/학습큐)·일화기억·장기기억·회의록 원문
This commit is contained in:
+208
@@ -0,0 +1,208 @@
|
||||
---
|
||||
id: wiki-2026-0508-distributed-tracing
|
||||
title: Distributed Tracing
|
||||
category: 10_Wiki/Topics
|
||||
status: verified
|
||||
canonical_id: self
|
||||
aliases: [distributed-tracing, opentelemetry-tracing, request-tracing]
|
||||
duplicate_of: none
|
||||
source_trust_level: A
|
||||
confidence_score: 0.9
|
||||
verification_status: applied
|
||||
tags: [observability, tracing, opentelemetry, jaeger]
|
||||
raw_sources: []
|
||||
last_reinforced: 2026-05-10
|
||||
github_commit: pending
|
||||
tech_stack:
|
||||
language: typescript
|
||||
framework: opentelemetry/tempo
|
||||
---
|
||||
|
||||
# Distributed Tracing
|
||||
|
||||
## 매 한 줄
|
||||
> **"매 trace 가 매 cross-service request 의 X-ray"**. 매 span tree 가 매 service hop, latency, error 의 reveal — 매 microservice debugging 의 essential. 2026 의 매 OpenTelemetry (OTel) 가 매 universal standard, 매 Tempo / Jaeger / Honeycomb / Datadog 가 매 backend, 매 W3C Trace Context 가 매 propagation.
|
||||
|
||||
## 매 핵심
|
||||
|
||||
### 매 building blocks
|
||||
- **Trace** — 매 root request 의 unique ID (trace_id, 128-bit).
|
||||
- **Span** — 매 single operation (HTTP call, DB query, function); has parent_span_id.
|
||||
- **Context propagation** — `traceparent` HTTP header (W3C) carries trace_id+span_id+flags.
|
||||
- **Baggage** — key/value propagated alongside (user_id, tenant).
|
||||
- **Sampling** — 매 head (decide at ingress) vs tail (decide after seeing whole trace).
|
||||
|
||||
### 매 OTel architecture
|
||||
1. **SDK** — instrumentation 의 in-app (auto + manual).
|
||||
2. **Collector** — 매 receive → process (batch, sample, redact) → export.
|
||||
3. **Backend** — Tempo (Grafana), Jaeger, Honeycomb, Datadog APM.
|
||||
4. **UI** — Grafana, Jaeger UI, vendor.
|
||||
|
||||
### 매 응용
|
||||
1. Latency root cause (which span 의 slow).
|
||||
2. Error correlation (trace 의 `error=true` spans).
|
||||
3. Service dependency map (service graph from spans).
|
||||
4. Capacity planning (RED metrics derived from spans).
|
||||
5. SLO debugging (trace 의 SLO budget burn 의 attribute).
|
||||
|
||||
## 💻 패턴
|
||||
|
||||
### OTel Node.js auto-instrumentation
|
||||
```ts
|
||||
// otel.ts (loaded with --import / NODE_OPTIONS)
|
||||
import { NodeSDK } from '@opentelemetry/sdk-node';
|
||||
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
|
||||
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
|
||||
import { Resource } from '@opentelemetry/resources';
|
||||
|
||||
const sdk = new NodeSDK({
|
||||
resource: new Resource({ 'service.name': 'orders-api', 'service.version': '1.4.2' }),
|
||||
traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }),
|
||||
instrumentations: [getNodeAutoInstrumentations()],
|
||||
});
|
||||
sdk.start();
|
||||
```
|
||||
|
||||
### Manual span (TypeScript)
|
||||
```ts
|
||||
import { trace, SpanStatusCode } from '@opentelemetry/api';
|
||||
const tracer = trace.getTracer('orders');
|
||||
|
||||
export async function placeOrder(input: OrderInput) {
|
||||
return tracer.startActiveSpan('placeOrder', async (span) => {
|
||||
span.setAttributes({ 'order.customer_id': input.customerId, 'order.line_count': input.lines.length });
|
||||
try {
|
||||
const order = await db.orders.insert(input);
|
||||
await kafka.produce('orders.placed', order);
|
||||
span.setStatus({ code: SpanStatusCode.OK });
|
||||
return order;
|
||||
} catch (e) {
|
||||
span.recordException(e as Error);
|
||||
span.setStatus({ code: SpanStatusCode.ERROR, message: (e as Error).message });
|
||||
throw e;
|
||||
} finally {
|
||||
span.end();
|
||||
}
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
### Python (FastAPI auto-instrument)
|
||||
```python
|
||||
from opentelemetry import trace
|
||||
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
|
||||
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
|
||||
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
|
||||
from opentelemetry.sdk.trace import TracerProvider
|
||||
from opentelemetry.sdk.trace.export import BatchSpanProcessor
|
||||
|
||||
provider = TracerProvider()
|
||||
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317")))
|
||||
trace.set_tracer_provider(provider)
|
||||
|
||||
app = FastAPI()
|
||||
FastAPIInstrumentor.instrument_app(app)
|
||||
HTTPXClientInstrumentor().instrument()
|
||||
```
|
||||
|
||||
### Trace context propagation (W3C)
|
||||
```text
|
||||
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
|
||||
^ ^ ^ ^
|
||||
version trace_id (32 hex) parent_id (16) flags
|
||||
tracestate: vendor1=value,vendor2=value
|
||||
baggage: userId=42,tenantId=acme
|
||||
```
|
||||
|
||||
### OTel Collector pipeline
|
||||
```yaml
|
||||
# otel-collector.yaml
|
||||
receivers:
|
||||
otlp: { protocols: { grpc: {}, http: {} } }
|
||||
processors:
|
||||
batch: { timeout: 1s }
|
||||
tail_sampling:
|
||||
decision_wait: 30s
|
||||
policies:
|
||||
- { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
|
||||
- { name: slow, type: latency, latency: { threshold_ms: 1000 } }
|
||||
- { name: rest, type: probabilistic, probabilistic: { sampling_percentage: 5 } }
|
||||
exporters:
|
||||
otlphttp/tempo: { endpoint: http://tempo:4318 }
|
||||
loki: { endpoint: http://loki:3100/loki/api/v1/push }
|
||||
service:
|
||||
pipelines:
|
||||
traces: { receivers: [otlp], processors: [batch, tail_sampling], exporters: [otlphttp/tempo] }
|
||||
```
|
||||
|
||||
### Trace ↔ Log correlation
|
||||
```ts
|
||||
import pino from 'pino';
|
||||
import { trace } from '@opentelemetry/api';
|
||||
const log = pino();
|
||||
function logWithTrace(msg: string, extra: object = {}) {
|
||||
const span = trace.getActiveSpan();
|
||||
const ctx = span?.spanContext();
|
||||
log.info({ ...extra, trace_id: ctx?.traceId, span_id: ctx?.spanId }, msg);
|
||||
}
|
||||
// 매 Loki/Tempo derived field 의 trace 의 jump from log line.
|
||||
```
|
||||
|
||||
### Frontend → backend trace
|
||||
```ts
|
||||
// 매 browser OTel SDK 의 traceparent 의 inject
|
||||
import { trace, context } from '@opentelemetry/api';
|
||||
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web';
|
||||
import { FetchInstrumentation } from '@opentelemetry/instrumentation-fetch';
|
||||
new WebTracerProvider().register();
|
||||
new FetchInstrumentation({
|
||||
propagateTraceHeaderCorsUrls: [/api\.example\.com/],
|
||||
}).enable();
|
||||
```
|
||||
|
||||
### eBPF-based zero-instrumentation (Beyla / Pixie)
|
||||
```bash
|
||||
# Grafana Beyla — 매 Go/Node/Python 의 auto-trace 의 eBPF 의 capture, 매 code change X.
|
||||
beyla --config beyla.yaml
|
||||
```
|
||||
|
||||
## 매 결정 기준
|
||||
| 상황 | Approach |
|
||||
|---|---|
|
||||
| Greenfield | OTel SDK + Tempo (Grafana stack) |
|
||||
| Multi-cloud SaaS | Honeycomb / Datadog APM |
|
||||
| Polyglot legacy | OTel Collector + auto-instrument per lang |
|
||||
| Zero-code start | eBPF (Beyla, Pixie) |
|
||||
| 매 cost control | Tail sampling on errors+slow, 5% baseline |
|
||||
| Strong cardinality | Honeycomb (designed for high-cardinality) |
|
||||
|
||||
**기본값**: OTel SDK + W3C Trace Context + Collector with tail sampling + Tempo or vendor backend.
|
||||
|
||||
## 🔗 Graph
|
||||
- 부모: [[Observability]] · [[Microservices]]
|
||||
- 변형: [[Tempo]]
|
||||
- 응용: [[Service-Mesh]]
|
||||
- Adjacent: [[OpenTelemetry]] · [[Logs]]
|
||||
|
||||
## 🤖 LLM 활용
|
||||
**언제**: 매 trace tree 의 root-cause hypothesis, 매 sampling policy review, 매 OTel Collector config debug, 매 span attribute schema design.
|
||||
**언제 X**: 매 production sampling decision 의 binding (cost + signal tradeoff 가 deep). 매 PII redaction 의 sole reviewer (security review 필요).
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **No sampling**: 매 cost / storage explode — tail sample on errors+slow.
|
||||
- **High-cardinality on every span**: 매 user_id on every span 가 indexable backend 가 X 면 expensive.
|
||||
- **Frontend trace 의 X**: 매 server-side latency 만 가 보임 — 매 user-perceived 의 miss.
|
||||
- **Logs without trace_id**: 매 trace ↔ log jump 가 X.
|
||||
- **Manual span without `end()`**: 매 leak.
|
||||
- **Sync span across async boundary**: 매 context lost — `startActiveSpan` 사용.
|
||||
- **Vendor lock-in via SDK**: 매 OTel SDK + vendor exporter 의 use, vendor SDK 의 X.
|
||||
|
||||
## 🧪 검증 / 중복
|
||||
- Verified (OpenTelemetry spec, W3C Trace Context, Grafana Tempo docs, Honeycomb engineering blog).
|
||||
- 신뢰도 A.
|
||||
|
||||
## 🕓 Changelog
|
||||
| 날짜 | 변경 |
|
||||
|---|---|
|
||||
| 2026-05-08 | Phase 1 |
|
||||
| 2026-05-10 | Manual cleanup — distributed tracing with OpenTelemetry, sampling, propagation |
|
||||
Reference in New Issue
Block a user