183 lines
5.3 KiB
Markdown
183 lines
5.3 KiB
Markdown
---
|
|
id: devops-observability-stack
|
|
title: Observability — Logs / Metrics / Traces
|
|
category: Coding
|
|
status: draft
|
|
source_trust_level: B
|
|
verification_status: conceptual
|
|
created_at: 2026-05-09
|
|
updated_at: 2026-05-09
|
|
tags: [devops, observability, prometheus, grafana, opentelemetry, vibe-coding]
|
|
tech_stack: { language: "TS / Prometheus / Grafana / OTEL", applicable_to: ["DevOps", "Backend"] }
|
|
applied_in: []
|
|
aliases: [observability, OpenTelemetry, OTEL, Prometheus, Grafana, Loki, Tempo, distributed tracing]
|
|
---
|
|
|
|
# Observability Stack
|
|
|
|
> 3가지: **Logs (무엇이 일어났나) + Metrics (얼마나) + Traces (어디서)**. **OpenTelemetry** 가 vendor-neutral 표준. Grafana stack (Prometheus + Loki + Tempo) 또는 Datadog / Honeycomb / SigNoz.
|
|
|
|
## 📖 핵심 개념
|
|
- Logs: structured (JSON). Loki / Elastic / Datadog Logs.
|
|
- Metrics: 시계열 숫자. Prometheus / Mimir / Datadog.
|
|
- Traces: request 의 service 간 이동 + timing. Tempo / Jaeger / Honeycomb.
|
|
- OTEL: 통합 SDK + collector. exporter 만 갈아끼움.
|
|
|
|
## 💻 코드 패턴
|
|
|
|
### Node + OTEL
|
|
```ts
|
|
// otel.ts (entry point 가장 위)
|
|
import { NodeSDK } from '@opentelemetry/sdk-node';
|
|
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
|
|
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
|
|
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
|
|
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
|
|
|
|
const sdk = new NodeSDK({
|
|
serviceName: 'api',
|
|
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
|
|
metricReader: new PeriodicExportingMetricReader({
|
|
exporter: new OTLPMetricExporter({ url: process.env.OTEL_ENDPOINT }),
|
|
}),
|
|
instrumentations: [getNodeAutoInstrumentations()],
|
|
});
|
|
|
|
sdk.start();
|
|
```
|
|
|
|
자동: HTTP / DB / Redis / fetch span 만들어줌.
|
|
|
|
### Manual span
|
|
```ts
|
|
import { trace, SpanStatusCode } from '@opentelemetry/api';
|
|
|
|
const tracer = trace.getTracer('app');
|
|
|
|
async function processOrder(id: string) {
|
|
return tracer.startActiveSpan('processOrder', async (span) => {
|
|
span.setAttributes({ 'order.id': id });
|
|
try {
|
|
const result = await doWork(id);
|
|
span.setStatus({ code: SpanStatusCode.OK });
|
|
return result;
|
|
} catch (e) {
|
|
span.recordException(e as Error);
|
|
span.setStatus({ code: SpanStatusCode.ERROR });
|
|
throw e;
|
|
} finally {
|
|
span.end();
|
|
}
|
|
});
|
|
}
|
|
```
|
|
|
|
### Metrics
|
|
```ts
|
|
import { metrics } from '@opentelemetry/api';
|
|
|
|
const meter = metrics.getMeter('app');
|
|
const orderCount = meter.createCounter('orders_total');
|
|
const orderLatency = meter.createHistogram('order_latency_ms');
|
|
|
|
orderCount.add(1, { region: 'us', plan: 'pro' });
|
|
const t = Date.now();
|
|
await doWork();
|
|
orderLatency.record(Date.now() - t);
|
|
```
|
|
|
|
### Structured logs (Pino)
|
|
```ts
|
|
import pino from 'pino';
|
|
|
|
const log = pino({
|
|
level: 'info',
|
|
formatters: {
|
|
level(label) { return { level: label }; },
|
|
},
|
|
});
|
|
|
|
log.info({ userId, orderId }, 'order created');
|
|
// {"level":"info","time":...,"userId":"u1","orderId":"o2","msg":"order created"}
|
|
```
|
|
|
|
Logs ↔ Traces 연결: trace_id / span_id 자동 inject (OTEL).
|
|
```ts
|
|
import { context, trace } from '@opentelemetry/api';
|
|
|
|
const span = trace.getSpan(context.active());
|
|
log.info({ traceId: span?.spanContext().traceId, ...attrs }, 'event');
|
|
```
|
|
|
|
### Prometheus metrics (legacy)
|
|
```ts
|
|
import client from 'prom-client';
|
|
|
|
const counter = new client.Counter({ name: 'orders_total', help: 'orders' });
|
|
|
|
app.get('/metrics', async (req, res) => {
|
|
res.set('Content-Type', client.register.contentType);
|
|
res.end(await client.register.metrics());
|
|
});
|
|
```
|
|
|
|
### RED method — 모든 service 의 기본 metrics
|
|
- **R**ate: 요청 / 초.
|
|
- **E**rrors: 에러율.
|
|
- **D**uration: latency p50/p95/p99.
|
|
|
|
```ts
|
|
// 모든 endpoint 자동
|
|
app.use(otelHttpMetricsMiddleware);
|
|
// 결과: http_server_duration_milliseconds, http_server_active_requests
|
|
```
|
|
|
|
### Trace + log 연결 (Grafana)
|
|
```
|
|
Grafana Loki → Tempo:
|
|
log line 의 trace_id 클릭 → 해당 trace 의 span 보기
|
|
```
|
|
|
|
### Alert (Prometheus)
|
|
```yaml
|
|
groups:
|
|
- name: api
|
|
rules:
|
|
- alert: HighErrorRate
|
|
expr: |
|
|
sum(rate(http_server_request_count{status=~"5.."}[5m])) /
|
|
sum(rate(http_server_request_count[5m])) > 0.05
|
|
for: 5m
|
|
labels: { severity: page }
|
|
annotations: { summary: "API 5xx > 5%" }
|
|
```
|
|
|
|
## 🤔 의사결정 기준
|
|
| 단계 | 추천 |
|
|
|---|---|
|
|
| MVP | Sentry (errors) + 기본 logs |
|
|
| Scale | OTEL + Grafana stack (Prom + Loki + Tempo) |
|
|
| 복잡 분산 | Honeycomb (high-cardinality 우월) |
|
|
| ZeroOps | Datadog / New Relic (비쌈) |
|
|
| Self-hosted | SigNoz (OTEL native, 단일 도구) |
|
|
| Open source | Grafana Cloud (free tier) |
|
|
|
|
## ❌ 안티패턴
|
|
- **Logs 만**: high-cardinality query 비싸 / 느림.
|
|
- **Metric label cardinality 폭발**: userId / requestId 절대 라벨 X.
|
|
- **Trace 100% sampling prod**: 비용 폭발. Head-based 1% 또는 tail-based.
|
|
- **Log 안 structured**: grep 만 됨, query 어려움.
|
|
- **Trace_id 분산 안 됨**: cross-service 연결 안 됨. OTEL propagator 자동.
|
|
- **Alert noise**: 5분 미만 burst → 너무 trigger. for: 5m+.
|
|
- **Dashboard 없는 metric**: 데이터만 있고 안 봄.
|
|
|
|
## 🤖 LLM 활용 힌트
|
|
- OTEL = vendor-neutral, 갈아끼움 가능.
|
|
- RED 모든 service.
|
|
- Logs JSON + trace_id link.
|
|
|
|
## 🔗 관련 문서
|
|
- [[Native_Crash_Reporting]]
|
|
- [[Backend_Webhook_Patterns]]
|
|
- [[Logging_Structured_Patterns]]
|