Files
2nd/10_Wiki/Topics/Coding/DevOps_Observability_Stack.md
T
2026-05-09 21:08:02 +09:00

5.3 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
devops-observability-stack Observability — Logs / Metrics / Traces Coding draft B conceptual 2026-05-09 2026-05-09
devops
observability
prometheus
grafana
opentelemetry
vibe-coding
language applicable_to
TS / Prometheus / Grafana / OTEL
DevOps
Backend
observability
OpenTelemetry
OTEL
Prometheus
Grafana
Loki
Tempo
distributed tracing

Observability Stack

3가지: Logs (무엇이 일어났나) + Metrics (얼마나) + Traces (어디서). OpenTelemetry 가 vendor-neutral 표준. Grafana stack (Prometheus + Loki + Tempo) 또는 Datadog / Honeycomb / SigNoz.

📖 핵심 개념

  • Logs: structured (JSON). Loki / Elastic / Datadog Logs.
  • Metrics: 시계열 숫자. Prometheus / Mimir / Datadog.
  • Traces: request 의 service 간 이동 + timing. Tempo / Jaeger / Honeycomb.
  • OTEL: 통합 SDK + collector. exporter 만 갈아끼움.

💻 코드 패턴

Node + OTEL

// otel.ts (entry point 가장 위)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'api',
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_ENDPOINT }),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

자동: HTTP / DB / Redis / fetch span 만들어줌.

Manual span

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('app');

async function processOrder(id: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    span.setAttributes({ 'order.id': id });
    try {
      const result = await doWork(id);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (e) {
      span.recordException(e as Error);
      span.setStatus({ code: SpanStatusCode.ERROR });
      throw e;
    } finally {
      span.end();
    }
  });
}

Metrics

import { metrics } from '@opentelemetry/api';

const meter = metrics.getMeter('app');
const orderCount = meter.createCounter('orders_total');
const orderLatency = meter.createHistogram('order_latency_ms');

orderCount.add(1, { region: 'us', plan: 'pro' });
const t = Date.now();
await doWork();
orderLatency.record(Date.now() - t);

Structured logs (Pino)

import pino from 'pino';

const log = pino({
  level: 'info',
  formatters: {
    level(label) { return { level: label }; },
  },
});

log.info({ userId, orderId }, 'order created');
// {"level":"info","time":...,"userId":"u1","orderId":"o2","msg":"order created"}

Logs ↔ Traces 연결: trace_id / span_id 자동 inject (OTEL).

import { context, trace } from '@opentelemetry/api';

const span = trace.getSpan(context.active());
log.info({ traceId: span?.spanContext().traceId, ...attrs }, 'event');

Prometheus metrics (legacy)

import client from 'prom-client';

const counter = new client.Counter({ name: 'orders_total', help: 'orders' });

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

RED method — 모든 service 의 기본 metrics

  • Rate: 요청 / 초.
  • Errors: 에러율.
  • Duration: latency p50/p95/p99.
// 모든 endpoint 자동
app.use(otelHttpMetricsMiddleware);
// 결과: http_server_duration_milliseconds, http_server_active_requests

Trace + log 연결 (Grafana)

Grafana Loki → Tempo:
log line 의 trace_id 클릭 → 해당 trace 의 span 보기

Alert (Prometheus)

groups:
- name: api
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_server_request_count{status=~"5.."}[5m])) /
      sum(rate(http_server_request_count[5m])) > 0.05
    for: 5m
    labels: { severity: page }
    annotations: { summary: "API 5xx > 5%" }

🤔 의사결정 기준

단계 추천
MVP Sentry (errors) + 기본 logs
Scale OTEL + Grafana stack (Prom + Loki + Tempo)
복잡 분산 Honeycomb (high-cardinality 우월)
ZeroOps Datadog / New Relic (비쌈)
Self-hosted SigNoz (OTEL native, 단일 도구)
Open source Grafana Cloud (free tier)

안티패턴

  • Logs 만: high-cardinality query 비싸 / 느림.
  • Metric label cardinality 폭발: userId / requestId 절대 라벨 X.
  • Trace 100% sampling prod: 비용 폭발. Head-based 1% 또는 tail-based.
  • Log 안 structured: grep 만 됨, query 어려움.
  • Trace_id 분산 안 됨: cross-service 연결 안 됨. OTEL propagator 자동.
  • Alert noise: 5분 미만 burst → 너무 trigger. for: 5m+.
  • Dashboard 없는 metric: 데이터만 있고 안 봄.

🤖 LLM 활용 힌트

  • OTEL = vendor-neutral, 갈아끼움 가능.
  • RED 모든 service.
  • Logs JSON + trace_id link.

🔗 관련 문서