Files
2nd/10_Wiki/Topics/Architecture/텔레메트리_(Telemetry).md
T
2026-05-10 22:08:15 +09:00

228 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-텔레메트리-telemetry
title: 텔레메트리 (Telemetry)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Telemetry, Observability data, Metrics + Traces + Logs]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [observability, otel, metrics, traces, logs, architecture]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: TypeScript
framework: OpenTelemetry 2.x
---
# 텔레메트리 (Telemetry)
## 매 한 줄
> **"매 system 이 자신의 internal state 를 외부로 emit 하는 행위 — 매 metric, trace, log 의 trinity."**. 매 Greek 어원 'tele (원격) + metron (측정)'. 2026 modern stack 의 매 de-facto standard 는 매 OpenTelemetry 2.x — 매 vendor-neutral 의 instrumentation API 와 매 OTLP wire protocol.
## 매 핵심
### 매 Three Pillars
- **Metrics**: 매 numeric aggregation (counter, gauge, histogram). 매 low cardinality. 매 alerting 의 source.
- **Traces**: 매 distributed request 의 causal chain. Span tree. 매 high cardinality.
- **Logs**: 매 discrete event records. 매 structured (JSON) 권장.
### 매 2026 추가 pillar
- **Profiles** (continuous profiling): 매 CPU / memory flame graph 의 sampling. eBPF + pprof 의 stack. Pyroscope / Parca / Grafana Profiles.
### 매 Push vs Pull
- **Push**: agent → collector (OTLP, statsd). 매 ephemeral workload 적합.
- **Pull**: scraper → endpoint (Prometheus). 매 long-running service 적합.
### 매 응용
1. SLO/SLI 의 측정 — 매 error budget 계산.
2. Distributed debugging — 매 trace 로 매 cross-service latency 추적.
3. Capacity planning — 매 historical metric 로 매 forecast.
4. Security audit — 매 log + trace 의 incident reconstruction.
## 💻 패턴
### Pattern 1 — OpenTelemetry SDK setup (Node)
```typescript
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { resourceFromAttributes } from '@opentelemetry/resources';
const sdk = new NodeSDK({
resource: resourceFromAttributes({
'service.name': 'order-api',
'service.version': '1.4.0',
'deployment.environment': process.env.ENV ?? 'dev',
}),
traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({ url: 'http://otel-collector:4318/v1/metrics' }),
exportIntervalMillis: 10_000,
}),
});
sdk.start();
```
### Pattern 2 — Manual span
```typescript
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-api');
async function placeOrder(orderId: string) {
return tracer.startActiveSpan('placeOrder', async (span) => {
try {
span.setAttribute('order.id', orderId);
const result = await chargeCard(orderId);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
});
}
```
### Pattern 3 — Counter / Histogram
```typescript
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-api');
const orderCounter = meter.createCounter('orders.placed', {
description: 'Total orders placed',
});
const latencyHist = meter.createHistogram('order.latency_ms', {
description: 'Order placement latency',
unit: 'ms',
});
const start = performance.now();
await placeOrder(id);
orderCounter.add(1, { region: 'kr', tier: 'premium' });
latencyHist.record(performance.now() - start, { route: 'POST /orders' });
```
### Pattern 4 — Structured logging with trace correlation
```typescript
import { trace } from '@opentelemetry/api';
import pino from 'pino';
const logger = pino({
mixin: () => {
const span = trace.getActiveSpan();
if (!span) return {};
const ctx = span.spanContext();
return { trace_id: ctx.traceId, span_id: ctx.spanId };
},
});
logger.info({ orderId: '123' }, 'order placed');
// 매 log → trace 의 매 join 가능.
```
### Pattern 5 — Sampling (head-based)
```typescript
import { TraceIdRatioBasedSampler, ParentBasedSampler } from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
sampler: new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1), // 매 10% sample.
}),
// ...
});
```
### Pattern 6 — Exemplar (metric → trace link)
```typescript
import { ExplicitBucketHistogramAggregation } from '@opentelemetry/sdk-metrics';
// 매 metric record 시 매 trace_id 첨부 — Grafana 의 매 metric → trace drill-down.
latencyHist.record(latency, attrs);
// 매 exemplar 는 매 SDK 가 매 active span 에서 자동 추출.
```
### Pattern 7 — Context propagation (HTTP header)
```typescript
import { propagation, context } from '@opentelemetry/api';
// 매 outbound — header inject.
const headers: Record<string, string> = {};
propagation.inject(context.active(), headers);
fetch('https://api.example.com', { headers });
// 매 inbound — header extract.
app.use((req, res, next) => {
const ctx = propagation.extract(context.active(), req.headers);
context.with(ctx, () => next());
});
// 매 traceparent / tracestate W3C header.
```
### Pattern 8 — RED method instrumentation
```typescript
// Rate, Errors, Duration — 매 service-level minimum.
const reqCounter = meter.createCounter('http.requests');
const errCounter = meter.createCounter('http.errors');
const durHist = meter.createHistogram('http.duration_ms');
app.use((req, res, next) => {
const start = performance.now();
res.on('finish', () => {
const labels = { route: req.route?.path, method: req.method, status: res.statusCode };
reqCounter.add(1, labels);
if (res.statusCode >= 500) errCounter.add(1, labels);
durHist.record(performance.now() - start, labels);
});
next();
});
```
## 매 결정 기준
| 상황 | Telemetry choice |
|---|---|
| 매 service-level alerting | Metrics (RED / USE) |
| 매 cross-service latency 분석 | Traces |
| 매 incident forensics | Logs + Traces |
| 매 CPU hotspot | Profiles (continuous) |
| 매 high cardinality dimension | Traces (NOT metrics) |
| 매 cost 민감 | Sampling 0.010.1 |
**기본값**: 매 OpenTelemetry SDK + OTLP exporter → Collector → Grafana / Datadog / Honeycomb. 매 vendor lock-in 의 회피.
## 🔗 Graph
- 부모: [[관측가능성 (Observability)]] · [[SRE 원칙]]
- 변형: [[Metrics (Prometheus)]] · [[Tracing (Jaeger / Tempo)]] · [[Logging (Loki / ELK)]] · [[Continuous Profiling]]
- 응용: [[SLO / SLI]] · [[분산 디버깅]] · [[Capacity Planning]]
- Adjacent: [[OpenTelemetry Collector]] · [[eBPF Observability]]
## 🤖 LLM 활용
**언제**: 매 production service 의 instrumentation 설계, OTel migration, 매 cardinality 분석.
**언제 X**: 매 dev-only script. 매 high cardinality dimension 을 metrics 에 — 매 cost explosion.
## ❌ 안티패턴
- **High cardinality on metrics**: 매 user_id 를 매 metric label — 매 storage 폭발.
- **Trace 만 의존**: 매 trace 는 매 sampled — 매 absolute count 신뢰 X.
- **Unstructured logs**: 매 string concat — 매 query 불가.
- **Vendor SDK lock-in**: 매 OTel 대신 매 Datadog SDK 직접 — 매 migration 비용.
- **No sampling**: 매 100% trace 전송 — 매 cost / latency 부담.
## 🧪 검증 / 중복
- Verified (OpenTelemetry 2.x docs 2026, CNCF observability whitepaper).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Three Pillars + Profiles + 8 OTel patterns |