Files
2nd/10_Wiki/Topics/Coding/Observability_RED_USE_Metrics.md
T
2026-05-09 21:08:02 +09:00

3.8 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
observability-red-use-metrics RED / USE 메트릭 — 어떤 걸 측정할까 Coding draft B conceptual 2026-05-09 2026-05-09
observability
metrics
sli
slo
vibe-coding
language applicable_to
Prometheus / Grafana
Backend
Rate
Errors
Duration
Utilization
Saturation
four golden signals

RED / USE 메트릭

측정 안 하면 운영 못 함. RED (서비스): Rate / Errors / Duration, USE (리소스): Utilization / Saturation / Errors. SRE 가 정의한 표준 출발점.

📖 핵심 개념

  • RED (요청 기반 서비스): 사용자 관점.
  • USE (리소스 기반): CPU/메모리/디스크/네트워크.
  • Four Golden Signals (Google SRE): Latency / Traffic / Errors / Saturation.

💻 코드 패턴

prom-client (Node)

import client from 'prom-client';
client.collectDefaultMetrics(); // CPU/heap/eventloop 자동

// RED — HTTP
const httpReqs = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

const httpDur = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.005, 0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});

app.use((req, res, next) => {
  const end = httpDur.startTimer({ method: req.method, route: req.route?.path ?? 'unknown' });
  res.on('finish', () => {
    const status = String(res.statusCode);
    end({ status });
    httpReqs.inc({ method: req.method, route: req.route?.path ?? 'unknown', status });
  });
  next();
});

app.get('/metrics', async (_, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

USE — DB pool

const dbPoolUtil = new client.Gauge({
  name: 'db_pool_utilization',
  help: 'Active connections / pool size',
});
const dbPoolSat = new client.Gauge({
  name: 'db_pool_waiting',
  help: 'Connections waiting',
});

setInterval(() => {
  dbPoolUtil.set((pool.totalCount - pool.idleCount) / pool.options.max);
  dbPoolSat.set(pool.waitingCount);
}, 5000);

Latency 분포 — Histogram, not Average

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# p95
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{route="/api/checkout"}[5m]))

Error rate

sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

🤔 의사결정 기준

영역 메트릭
HTTP API RED — 라벨: method, route, status
워커 / job queue RED + queue length / age
DB / Redis USE — connections, latency, error rate
외부 API 호출 RED — provider, status
Business KPI gauge / counter (signups, paid_orders)
Infra (Pod CPU) Kubernetes 자동

안티패턴

  • 평균만: 평균은 거짓말. p95 / p99 가 사용자 경험.
  • 라벨 폭증: userId / requestId 라벨 → cardinality 폭증 → 메모리 / 비용 폭사. 라벨은 enum 같은 것만.
  • histogram bucket 부적합: 1ms~10s 인데 bucket 이 10s 단위. 의미 없음.
  • counter 와 gauge 혼동: counter 는 monotonic. 감소 안 함.
  • 리셋 시 0 dump: counter 는 reset 알아서 처리. gauge 만 직접 set.
  • 메트릭 / 로그 / trace 따로 봄: exemplar / trace_id 로 연결.
  • 알림 임계값 절대값: 트래픽 변동에 거짓 알림. ratio + window.

🤖 LLM 활용 힌트

  • 새 endpoint 마다 RED 자동 (middleware).
  • 외부 의존성마다 별도 RED.
  • p95/p99 SLO 정의 후 알림.

🔗 관련 문서