Files
2nd/10_Wiki/Topics/Coding/DevOps_OTel_Collector.md
T
2026-05-09 21:08:02 +09:00

5.6 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
devops-otel-collector OTel Collector — Pipeline / Sampling / Routing Coding draft B conceptual 2026-05-09 2026-05-09
devops
otel
opentelemetry
observability
vibe-coding
language applicable_to
YAML / OTel
DevOps
OpenTelemetry Collector
OTel
receivers
processors
exporters
tail sampling

OTel Collector

Telemetry router: receive → process → export. 앱은 OTLP 만 — Collector 가 backend 갈아끼움. Tail sampling, attribute scrubbing, multi-export 모두 한 곳.

📖 핵심 개념

  • Receivers: OTLP / Prometheus / Jaeger / Zipkin / Fluent.
  • Processors: batch / filter / sample / attribute.
  • Exporters: 어디로 보낼지 (Datadog / Honeycomb / Tempo).
  • Pipeline: receivers → processors → exporters.

💻 코드 패턴

기본 config

# otel-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 100

  attributes:
    actions:
      - key: deployment.environment
        value: prod
        action: insert
      - key: user.email
        action: delete  # PII

  resource:
    attributes:
      - key: service.namespace
        value: acme
        action: insert

exporters:
  otlphttp/honeycomb:
    endpoint: https://api.honeycomb.io
    headers: { x-honeycomb-team: $HC_KEY }
  
  prometheus:
    endpoint: 0.0.0.0:8889

  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, attributes, batch]
      exporters: [otlphttp/honeycomb]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

Tail sampling (head 가 아닌 tail)

processors:
  tail_sampling:
    decision_wait: 30s   # span 끝나길 기다림
    num_traces: 100000
    expected_new_traces_per_sec: 10
    policies:
      - name: error-traces
        type: status_code
        status_code: { status_codes: [ERROR] }
      
      - name: slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      
      - name: 1-percent-baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

      - name: high-value-customer
        type: string_attribute
        string_attribute:
          key: user.plan
          values: [enterprise]

→ 에러 / 느림 / VIP 항상 keep, 나머지 1%.

Attribute scrubbing (PII)

processors:
  redaction:
    allow_all_keys: true
    blocked_values:
      - '\d{3}-\d{2}-\d{4}'   # SSN
      - '4[0-9]{12}(?:[0-9]{3})?'  # credit card

  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: user.email
        action: hash  # SHA256

Multi-export (split traffic)

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/datadog, otlphttp/honeycomb]
      # 둘 다 보냄

Filter (drop)

processors:
  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/metrics"'

→ Health check span 제거 — noise + cost.

Routing (조건별 다른 backend)

processors:
  routing:
    from_attribute: deployment.environment
    table:
      - value: prod
        exporters: [otlphttp/honeycomb]
      - value: dev
        exporters: [debug]

Sidecar pattern (Kubernetes)

# 각 pod 옆 collector
spec:
  containers:
    - name: app
      image: myapp
      env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: http://localhost:4317
    - name: otel-collector
      image: otel/opentelemetry-collector-contrib
      args: [--config=/etc/otel/config.yaml]

Gateway pattern (cluster-level)

App → sidecar → gateway collector → backend

→ 중앙 집계 + 정책 적용.

자체 metrics (collector 자신)

service:
  telemetry:
    metrics:
      address: 0.0.0.0:8888
      level: detailed

→ Prometheus 가 collector 자체 monitoring.

Metric (host / process)

receivers:
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
      network: {}

Log (Fluentbit / Loki)

receivers:
  filelog:
    include: [/var/log/app/*.log]
    operators:
      - type: json_parser

🤔 의사결정 기준

환경 패턴
단일 서비스 App → Collector → Backend
K8s 다중 service Sidecar + Gateway
Traffic 큼 Gateway only
Multi-cloud Gateway 가 routing
비용 절감 Tail sampling + filter
Privacy 강 redaction processor

안티패턴

  • App 이 Datadog / Honeycomb 직접: vendor lock-in. OTLP + Collector.
  • Tail sampling + 작은 buffer: 의미 있는 trace 잃음. num_traces 충분.
  • 모든 trace 100%: 비용 폭발. probabilistic + tail.
  • PII redaction 없음: GDPR 위반.
  • Collector 없는 sampling: SDK 의 head sampling 만 — 에러 trace 잃음.
  • Memory_limiter 없음: OOM.
  • Batch 너무 큼 (10K): latency.

🤖 LLM 활용 힌트

  • App = OTLP 만, Collector 가 라우팅.
  • Tail sampling = error / slow / VIP 우선.
  • PII redaction + filter (health) 항상.

🔗 관련 문서