--- id: devops-otel-collector title: OTel Collector — Pipeline / Sampling / Routing category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [devops, otel, opentelemetry, observability, vibe-coding] tech_stack: { language: "YAML / OTel", applicable_to: ["DevOps"] } applied_in: [] aliases: [OpenTelemetry Collector, OTel, receivers, processors, exporters, tail sampling] --- # OTel Collector > Telemetry router: receive → process → export. **앱은 OTLP 만 — Collector 가 backend 갈아끼움**. Tail sampling, attribute scrubbing, multi-export 모두 한 곳. ## 📖 핵심 개념 - Receivers: OTLP / Prometheus / Jaeger / Zipkin / Fluent. - Processors: batch / filter / sample / attribute. - Exporters: 어디로 보낼지 (Datadog / Honeycomb / Tempo). - Pipeline: receivers → processors → exporters. ## 💻 코드 패턴 ### 기본 config ```yaml # otel-config.yaml receivers: otlp: protocols: grpc: { endpoint: 0.0.0.0:4317 } http: { endpoint: 0.0.0.0:4318 } processors: batch: timeout: 10s send_batch_size: 1024 memory_limiter: check_interval: 1s limit_mib: 512 spike_limit_mib: 100 attributes: actions: - key: deployment.environment value: prod action: insert - key: user.email action: delete # PII resource: attributes: - key: service.namespace value: acme action: insert exporters: otlphttp/honeycomb: endpoint: https://api.honeycomb.io headers: { x-honeycomb-team: $HC_KEY } prometheus: endpoint: 0.0.0.0:8889 debug: verbosity: detailed service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, attributes, batch] exporters: [otlphttp/honeycomb] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus] ``` ### Tail sampling (head 가 아닌 tail) ```yaml processors: tail_sampling: decision_wait: 30s # span 끝나길 기다림 num_traces: 100000 expected_new_traces_per_sec: 10 policies: - name: error-traces type: status_code status_code: { status_codes: [ERROR] } - name: slow-traces type: latency latency: { threshold_ms: 1000 } - name: 1-percent-baseline type: probabilistic probabilistic: { sampling_percentage: 1 } - name: high-value-customer type: string_attribute string_attribute: key: user.plan values: [enterprise] ``` → 에러 / 느림 / VIP 항상 keep, 나머지 1%. ### Attribute scrubbing (PII) ```yaml processors: redaction: allow_all_keys: true blocked_values: - '\d{3}-\d{2}-\d{4}' # SSN - '4[0-9]{12}(?:[0-9]{3})?' # credit card attributes: actions: - key: http.request.header.authorization action: delete - key: user.email action: hash # SHA256 ``` ### Multi-export (split traffic) ```yaml service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlphttp/datadog, otlphttp/honeycomb] # 둘 다 보냄 ``` ### Filter (drop) ```yaml processors: filter: traces: span: - 'attributes["http.target"] == "/health"' - 'attributes["http.target"] == "/metrics"' ``` → Health check span 제거 — noise + cost. ### Routing (조건별 다른 backend) ```yaml processors: routing: from_attribute: deployment.environment table: - value: prod exporters: [otlphttp/honeycomb] - value: dev exporters: [debug] ``` ### Sidecar pattern (Kubernetes) ```yaml # 각 pod 옆 collector spec: containers: - name: app image: myapp env: - name: OTEL_EXPORTER_OTLP_ENDPOINT value: http://localhost:4317 - name: otel-collector image: otel/opentelemetry-collector-contrib args: [--config=/etc/otel/config.yaml] ``` ### Gateway pattern (cluster-level) ``` App → sidecar → gateway collector → backend ``` → 중앙 집계 + 정책 적용. ### 자체 metrics (collector 자신) ```yaml service: telemetry: metrics: address: 0.0.0.0:8888 level: detailed ``` → Prometheus 가 collector 자체 monitoring. ### Metric (host / process) ```yaml receivers: hostmetrics: collection_interval: 30s scrapers: cpu: {} memory: {} disk: {} network: {} ``` ### Log (Fluentbit / Loki) ```yaml receivers: filelog: include: [/var/log/app/*.log] operators: - type: json_parser ``` ## 🤔 의사결정 기준 | 환경 | 패턴 | |---|---| | 단일 서비스 | App → Collector → Backend | | K8s 다중 service | Sidecar + Gateway | | Traffic 큼 | Gateway only | | Multi-cloud | Gateway 가 routing | | 비용 절감 | Tail sampling + filter | | Privacy 강 | redaction processor | ## ❌ 안티패턴 - **App 이 Datadog / Honeycomb 직접**: vendor lock-in. OTLP + Collector. - **Tail sampling + 작은 buffer**: 의미 있는 trace 잃음. num_traces 충분. - **모든 trace 100%**: 비용 폭발. probabilistic + tail. - **PII redaction 없음**: GDPR 위반. - **Collector 없는 sampling**: SDK 의 head sampling 만 — 에러 trace 잃음. - **Memory_limiter 없음**: OOM. - **Batch 너무 큼 (10K)**: latency. ## 🤖 LLM 활용 힌트 - App = OTLP 만, Collector 가 라우팅. - Tail sampling = error / slow / VIP 우선. - PII redaction + filter (health) 항상. ## 🔗 관련 문서 - [[DevOps_Observability_Stack]] - [[Native_Crash_Reporting]] - [[Observability_OpenTelemetry]]