--- id: devops-observability-stack title: Observability — Logs / Metrics / Traces category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [devops, observability, prometheus, grafana, opentelemetry, vibe-coding] tech_stack: { language: "TS / Prometheus / Grafana / OTEL", applicable_to: ["DevOps", "Backend"] } applied_in: [] aliases: [observability, OpenTelemetry, OTEL, Prometheus, Grafana, Loki, Tempo, distributed tracing] --- # Observability Stack > 3가지: **Logs (무엇이 일어났나) + Metrics (얼마나) + Traces (어디서)**. **OpenTelemetry** 가 vendor-neutral 표준. Grafana stack (Prometheus + Loki + Tempo) 또는 Datadog / Honeycomb / SigNoz. ## 📖 핵심 개념 - Logs: structured (JSON). Loki / Elastic / Datadog Logs. - Metrics: 시계열 숫자. Prometheus / Mimir / Datadog. - Traces: request 의 service 간 이동 + timing. Tempo / Jaeger / Honeycomb. - OTEL: 통합 SDK + collector. exporter 만 갈아끼움. ## 💻 코드 패턴 ### Node + OTEL ```ts // otel.ts (entry point 가장 위) import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http'; import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; const sdk = new NodeSDK({ serviceName: 'api', traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }), metricReader: new PeriodicExportingMetricReader({ exporter: new OTLPMetricExporter({ url: process.env.OTEL_ENDPOINT }), }), instrumentations: [getNodeAutoInstrumentations()], }); sdk.start(); ``` 자동: HTTP / DB / Redis / fetch span 만들어줌. ### Manual span ```ts import { trace, SpanStatusCode } from '@opentelemetry/api'; const tracer = trace.getTracer('app'); async function processOrder(id: string) { return tracer.startActiveSpan('processOrder', async (span) => { span.setAttributes({ 'order.id': id }); try { const result = await doWork(id); span.setStatus({ code: SpanStatusCode.OK }); return result; } catch (e) { span.recordException(e as Error); span.setStatus({ code: SpanStatusCode.ERROR }); throw e; } finally { span.end(); } }); } ``` ### Metrics ```ts import { metrics } from '@opentelemetry/api'; const meter = metrics.getMeter('app'); const orderCount = meter.createCounter('orders_total'); const orderLatency = meter.createHistogram('order_latency_ms'); orderCount.add(1, { region: 'us', plan: 'pro' }); const t = Date.now(); await doWork(); orderLatency.record(Date.now() - t); ``` ### Structured logs (Pino) ```ts import pino from 'pino'; const log = pino({ level: 'info', formatters: { level(label) { return { level: label }; }, }, }); log.info({ userId, orderId }, 'order created'); // {"level":"info","time":...,"userId":"u1","orderId":"o2","msg":"order created"} ``` Logs ↔ Traces 연결: trace_id / span_id 자동 inject (OTEL). ```ts import { context, trace } from '@opentelemetry/api'; const span = trace.getSpan(context.active()); log.info({ traceId: span?.spanContext().traceId, ...attrs }, 'event'); ``` ### Prometheus metrics (legacy) ```ts import client from 'prom-client'; const counter = new client.Counter({ name: 'orders_total', help: 'orders' }); app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); }); ``` ### RED method — 모든 service 의 기본 metrics - **R**ate: 요청 / 초. - **E**rrors: 에러율. - **D**uration: latency p50/p95/p99. ```ts // 모든 endpoint 자동 app.use(otelHttpMetricsMiddleware); // 결과: http_server_duration_milliseconds, http_server_active_requests ``` ### Trace + log 연결 (Grafana) ``` Grafana Loki → Tempo: log line 의 trace_id 클릭 → 해당 trace 의 span 보기 ``` ### Alert (Prometheus) ```yaml groups: - name: api rules: - alert: HighErrorRate expr: | sum(rate(http_server_request_count{status=~"5.."}[5m])) / sum(rate(http_server_request_count[5m])) > 0.05 for: 5m labels: { severity: page } annotations: { summary: "API 5xx > 5%" } ``` ## 🤔 의사결정 기준 | 단계 | 추천 | |---|---| | MVP | Sentry (errors) + 기본 logs | | Scale | OTEL + Grafana stack (Prom + Loki + Tempo) | | 복잡 분산 | Honeycomb (high-cardinality 우월) | | ZeroOps | Datadog / New Relic (비쌈) | | Self-hosted | SigNoz (OTEL native, 단일 도구) | | Open source | Grafana Cloud (free tier) | ## ❌ 안티패턴 - **Logs 만**: high-cardinality query 비싸 / 느림. - **Metric label cardinality 폭발**: userId / requestId 절대 라벨 X. - **Trace 100% sampling prod**: 비용 폭발. Head-based 1% 또는 tail-based. - **Log 안 structured**: grep 만 됨, query 어려움. - **Trace_id 분산 안 됨**: cross-service 연결 안 됨. OTEL propagator 자동. - **Alert noise**: 5분 미만 burst → 너무 trigger. for: 5m+. - **Dashboard 없는 metric**: 데이터만 있고 안 봄. ## 🤖 LLM 활용 힌트 - OTEL = vendor-neutral, 갈아끼움 가능. - RED 모든 service. - Logs JSON + trace_id link. ## 🔗 관련 문서 - [[Native_Crash_Reporting]] - [[Backend_Webhook_Patterns]]