5.3 KiB
5.3 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| devops-observability-stack | Observability — Logs / Metrics / Traces | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Observability Stack
3가지: Logs (무엇이 일어났나) + Metrics (얼마나) + Traces (어디서). OpenTelemetry 가 vendor-neutral 표준. Grafana stack (Prometheus + Loki + Tempo) 또는 Datadog / Honeycomb / SigNoz.
📖 핵심 개념
- Logs: structured (JSON). Loki / Elastic / Datadog Logs.
- Metrics: 시계열 숫자. Prometheus / Mimir / Datadog.
- Traces: request 의 service 간 이동 + timing. Tempo / Jaeger / Honeycomb.
- OTEL: 통합 SDK + collector. exporter 만 갈아끼움.
💻 코드 패턴
Node + OTEL
// otel.ts (entry point 가장 위)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
serviceName: 'api',
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({ url: process.env.OTEL_ENDPOINT }),
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
자동: HTTP / DB / Redis / fetch span 만들어줌.
Manual span
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('app');
async function processOrder(id: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttributes({ 'order.id': id });
try {
const result = await doWork(id);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (e) {
span.recordException(e as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw e;
} finally {
span.end();
}
});
}
Metrics
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('app');
const orderCount = meter.createCounter('orders_total');
const orderLatency = meter.createHistogram('order_latency_ms');
orderCount.add(1, { region: 'us', plan: 'pro' });
const t = Date.now();
await doWork();
orderLatency.record(Date.now() - t);
Structured logs (Pino)
import pino from 'pino';
const log = pino({
level: 'info',
formatters: {
level(label) { return { level: label }; },
},
});
log.info({ userId, orderId }, 'order created');
// {"level":"info","time":...,"userId":"u1","orderId":"o2","msg":"order created"}
Logs ↔ Traces 연결: trace_id / span_id 자동 inject (OTEL).
import { context, trace } from '@opentelemetry/api';
const span = trace.getSpan(context.active());
log.info({ traceId: span?.spanContext().traceId, ...attrs }, 'event');
Prometheus metrics (legacy)
import client from 'prom-client';
const counter = new client.Counter({ name: 'orders_total', help: 'orders' });
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
RED method — 모든 service 의 기본 metrics
- Rate: 요청 / 초.
- Errors: 에러율.
- Duration: latency p50/p95/p99.
// 모든 endpoint 자동
app.use(otelHttpMetricsMiddleware);
// 결과: http_server_duration_milliseconds, http_server_active_requests
Trace + log 연결 (Grafana)
Grafana Loki → Tempo:
log line 의 trace_id 클릭 → 해당 trace 의 span 보기
Alert (Prometheus)
groups:
- name: api
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_server_request_count{status=~"5.."}[5m])) /
sum(rate(http_server_request_count[5m])) > 0.05
for: 5m
labels: { severity: page }
annotations: { summary: "API 5xx > 5%" }
🤔 의사결정 기준
| 단계 | 추천 |
|---|---|
| MVP | Sentry (errors) + 기본 logs |
| Scale | OTEL + Grafana stack (Prom + Loki + Tempo) |
| 복잡 분산 | Honeycomb (high-cardinality 우월) |
| ZeroOps | Datadog / New Relic (비쌈) |
| Self-hosted | SigNoz (OTEL native, 단일 도구) |
| Open source | Grafana Cloud (free tier) |
❌ 안티패턴
- Logs 만: high-cardinality query 비싸 / 느림.
- Metric label cardinality 폭발: userId / requestId 절대 라벨 X.
- Trace 100% sampling prod: 비용 폭발. Head-based 1% 또는 tail-based.
- Log 안 structured: grep 만 됨, query 어려움.
- Trace_id 분산 안 됨: cross-service 연결 안 됨. OTEL propagator 자동.
- Alert noise: 5분 미만 burst → 너무 trigger. for: 5m+.
- Dashboard 없는 metric: 데이터만 있고 안 봄.
🤖 LLM 활용 힌트
- OTEL = vendor-neutral, 갈아끼움 가능.
- RED 모든 service.
- Logs JSON + trace_id link.