--- id: observability-red-use-metrics title: RED / USE 메트릭 — 어떤 걸 측정할까 category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [observability, metrics, sli, slo, vibe-coding] tech_stack: { language: "Prometheus / Grafana", applicable_to: ["Backend"] } applied_in: [] aliases: [Rate, Errors, Duration, Utilization, Saturation, four golden signals] --- # RED / USE 메트릭 > 측정 안 하면 운영 못 함. **RED (서비스): Rate / Errors / Duration**, **USE (리소스): Utilization / Saturation / Errors**. SRE 가 정의한 표준 출발점. ## 📖 핵심 개념 - **RED** (요청 기반 서비스): 사용자 관점. - **USE** (리소스 기반): CPU/메모리/디스크/네트워크. - **Four Golden Signals** (Google SRE): Latency / Traffic / Errors / Saturation. ## 💻 코드 패턴 ### prom-client (Node) ```ts import client from 'prom-client'; client.collectDefaultMetrics(); // CPU/heap/eventloop 자동 // RED — HTTP const httpReqs = new client.Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'route', 'status'], }); const httpDur = new client.Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration', labelNames: ['method', 'route', 'status'], buckets: [0.005, 0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5], }); app.use((req, res, next) => { const end = httpDur.startTimer({ method: req.method, route: req.route?.path ?? 'unknown' }); res.on('finish', () => { const status = String(res.statusCode); end({ status }); httpReqs.inc({ method: req.method, route: req.route?.path ?? 'unknown', status }); }); next(); }); app.get('/metrics', async (_, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); }); ``` ### USE — DB pool ```ts const dbPoolUtil = new client.Gauge({ name: 'db_pool_utilization', help: 'Active connections / pool size', }); const dbPoolSat = new client.Gauge({ name: 'db_pool_waiting', help: 'Connections waiting', }); setInterval(() => { dbPoolUtil.set((pool.totalCount - pool.idleCount) / pool.options.max); dbPoolSat.set(pool.waitingCount); }, 5000); ``` ### Latency 분포 — Histogram, not Average ```promql histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # p95 histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{route="/api/checkout"}[5m])) ``` ### Error rate ```promql sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ``` ## 🤔 의사결정 기준 | 영역 | 메트릭 | |---|---| | HTTP API | RED — 라벨: method, route, status | | 워커 / job queue | RED + queue length / age | | DB / Redis | USE — connections, latency, error rate | | 외부 API 호출 | RED — provider, status | | Business KPI | gauge / counter (signups, paid_orders) | | Infra (Pod CPU) | Kubernetes 자동 | ## ❌ 안티패턴 - **평균만**: 평균은 거짓말. p95 / p99 가 사용자 경험. - **라벨 폭증**: userId / requestId 라벨 → cardinality 폭증 → 메모리 / 비용 폭사. 라벨은 enum 같은 것만. - **histogram bucket 부적합**: 1ms~10s 인데 bucket 이 10s 단위. 의미 없음. - **counter 와 gauge 혼동**: counter 는 monotonic. 감소 안 함. - **리셋 시 0 dump**: counter 는 reset 알아서 처리. gauge 만 직접 set. - **메트릭 / 로그 / trace 따로 봄**: exemplar / trace_id 로 연결. - **알림 임계값 절대값**: 트래픽 변동에 거짓 알림. ratio + window. ## 🤖 LLM 활용 힌트 - 새 endpoint 마다 RED 자동 (middleware). - 외부 의존성마다 별도 RED. - p95/p99 SLO 정의 후 알림. ## 🔗 관련 문서 - [[Observability_Structured_Logging]] - [[Observability_OpenTelemetry]] - [[Backend_Health_Check_Patterns]]