[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,182 @@
---
id: devops-observability-stack
title: Observability — Logs / Metrics / Traces
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [devops, observability, prometheus, grafana, opentelemetry, vibe-coding]
tech_stack: { language: "TS / Prometheus / Grafana / OTEL", applicable_to: ["DevOps", "Backend"] }
applied_in: []
aliases: [observability, OpenTelemetry, OTEL, Prometheus, Grafana, Loki, Tempo, distributed tracing]
---
# Observability Stack
> 3가지: **Logs (무엇이 일어났나) + Metrics (얼마나) + Traces (어디서)**. **OpenTelemetry** 가 vendor-neutral 표준. Grafana stack (Prometheus + Loki + Tempo) 또는 Datadog / Honeycomb / SigNoz.
## 📖 핵심 개념
- Logs: structured (JSON). Loki / Elastic / Datadog Logs.
- Metrics: 시계열 숫자. Prometheus / Mimir / Datadog.
- Traces: request 의 service 간 이동 + timing. Tempo / Jaeger / Honeycomb.
- OTEL: 통합 SDK + collector. exporter 만 갈아끼움.
## 💻 코드 패턴
### Node + OTEL
```ts
// otel.ts (entry point 가장 위)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
serviceName: 'api',
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_ENDPOINT }),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({ url: process.env.OTEL_ENDPOINT }),
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
```
자동: HTTP / DB / Redis / fetch span 만들어줌.
### Manual span
```ts
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('app');
async function processOrder(id: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttributes({ 'order.id': id });
try {
const result = await doWork(id);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (e) {
span.recordException(e as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw e;
} finally {
span.end();
}
});
}
```
### Metrics
```ts
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('app');
const orderCount = meter.createCounter('orders_total');
const orderLatency = meter.createHistogram('order_latency_ms');
orderCount.add(1, { region: 'us', plan: 'pro' });
const t = Date.now();
await doWork();
orderLatency.record(Date.now() - t);
```
### Structured logs (Pino)
```ts
import pino from 'pino';
const log = pino({
level: 'info',
formatters: {
level(label) { return { level: label }; },
},
});
log.info({ userId, orderId }, 'order created');
// {"level":"info","time":...,"userId":"u1","orderId":"o2","msg":"order created"}
```
Logs ↔ Traces 연결: trace_id / span_id 자동 inject (OTEL).
```ts
import { context, trace } from '@opentelemetry/api';
const span = trace.getSpan(context.active());
log.info({ traceId: span?.spanContext().traceId, ...attrs }, 'event');
```
### Prometheus metrics (legacy)
```ts
import client from 'prom-client';
const counter = new client.Counter({ name: 'orders_total', help: 'orders' });
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
```
### RED method — 모든 service 의 기본 metrics
- **R**ate: 요청 / 초.
- **E**rrors: 에러율.
- **D**uration: latency p50/p95/p99.
```ts
// 모든 endpoint 자동
app.use(otelHttpMetricsMiddleware);
// 결과: http_server_duration_milliseconds, http_server_active_requests
```
### Trace + log 연결 (Grafana)
```
Grafana Loki → Tempo:
log line 의 trace_id 클릭 → 해당 trace 의 span 보기
```
### Alert (Prometheus)
```yaml
groups:
- name: api
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_server_request_count{status=~"5.."}[5m])) /
sum(rate(http_server_request_count[5m])) > 0.05
for: 5m
labels: { severity: page }
annotations: { summary: "API 5xx > 5%" }
```
## 🤔 의사결정 기준
| 단계 | 추천 |
|---|---|
| MVP | Sentry (errors) + 기본 logs |
| Scale | OTEL + Grafana stack (Prom + Loki + Tempo) |
| 복잡 분산 | Honeycomb (high-cardinality 우월) |
| ZeroOps | Datadog / New Relic (비쌈) |
| Self-hosted | SigNoz (OTEL native, 단일 도구) |
| Open source | Grafana Cloud (free tier) |
## ❌ 안티패턴
- **Logs 만**: high-cardinality query 비싸 / 느림.
- **Metric label cardinality 폭발**: userId / requestId 절대 라벨 X.
- **Trace 100% sampling prod**: 비용 폭발. Head-based 1% 또는 tail-based.
- **Log 안 structured**: grep 만 됨, query 어려움.
- **Trace_id 분산 안 됨**: cross-service 연결 안 됨. OTEL propagator 자동.
- **Alert noise**: 5분 미만 burst → 너무 trigger. for: 5m+.
- **Dashboard 없는 metric**: 데이터만 있고 안 봄.
## 🤖 LLM 활용 힌트
- OTEL = vendor-neutral, 갈아끼움 가능.
- RED 모든 service.
- Logs JSON + trace_id link.
## 🔗 관련 문서
- [[Native_Crash_Reporting]]
- [[Backend_Webhook_Patterns]]
- [[Logging_Structured_Patterns]]