5.6 KiB
5.6 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| devops-otel-collector | OTel Collector — Pipeline / Sampling / Routing | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
OTel Collector
Telemetry router: receive → process → export. 앱은 OTLP 만 — Collector 가 backend 갈아끼움. Tail sampling, attribute scrubbing, multi-export 모두 한 곳.
📖 핵심 개념
- Receivers: OTLP / Prometheus / Jaeger / Zipkin / Fluent.
- Processors: batch / filter / sample / attribute.
- Exporters: 어디로 보낼지 (Datadog / Honeycomb / Tempo).
- Pipeline: receivers → processors → exporters.
💻 코드 패턴
기본 config
# otel-config.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 100
attributes:
actions:
- key: deployment.environment
value: prod
action: insert
- key: user.email
action: delete # PII
resource:
attributes:
- key: service.namespace
value: acme
action: insert
exporters:
otlphttp/honeycomb:
endpoint: https://api.honeycomb.io
headers: { x-honeycomb-team: $HC_KEY }
prometheus:
endpoint: 0.0.0.0:8889
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, attributes, batch]
exporters: [otlphttp/honeycomb]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
Tail sampling (head 가 아닌 tail)
processors:
tail_sampling:
decision_wait: 30s # span 끝나길 기다림
num_traces: 100000
expected_new_traces_per_sec: 10
policies:
- name: error-traces
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces
type: latency
latency: { threshold_ms: 1000 }
- name: 1-percent-baseline
type: probabilistic
probabilistic: { sampling_percentage: 1 }
- name: high-value-customer
type: string_attribute
string_attribute:
key: user.plan
values: [enterprise]
→ 에러 / 느림 / VIP 항상 keep, 나머지 1%.
Attribute scrubbing (PII)
processors:
redaction:
allow_all_keys: true
blocked_values:
- '\d{3}-\d{2}-\d{4}' # SSN
- '4[0-9]{12}(?:[0-9]{3})?' # credit card
attributes:
actions:
- key: http.request.header.authorization
action: delete
- key: user.email
action: hash # SHA256
Multi-export (split traffic)
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/datadog, otlphttp/honeycomb]
# 둘 다 보냄
Filter (drop)
processors:
filter:
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/metrics"'
→ Health check span 제거 — noise + cost.
Routing (조건별 다른 backend)
processors:
routing:
from_attribute: deployment.environment
table:
- value: prod
exporters: [otlphttp/honeycomb]
- value: dev
exporters: [debug]
Sidecar pattern (Kubernetes)
# 각 pod 옆 collector
spec:
containers:
- name: app
image: myapp
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://localhost:4317
- name: otel-collector
image: otel/opentelemetry-collector-contrib
args: [--config=/etc/otel/config.yaml]
Gateway pattern (cluster-level)
App → sidecar → gateway collector → backend
→ 중앙 집계 + 정책 적용.
자체 metrics (collector 자신)
service:
telemetry:
metrics:
address: 0.0.0.0:8888
level: detailed
→ Prometheus 가 collector 자체 monitoring.
Metric (host / process)
receivers:
hostmetrics:
collection_interval: 30s
scrapers:
cpu: {}
memory: {}
disk: {}
network: {}
Log (Fluentbit / Loki)
receivers:
filelog:
include: [/var/log/app/*.log]
operators:
- type: json_parser
🤔 의사결정 기준
| 환경 | 패턴 |
|---|---|
| 단일 서비스 | App → Collector → Backend |
| K8s 다중 service | Sidecar + Gateway |
| Traffic 큼 | Gateway only |
| Multi-cloud | Gateway 가 routing |
| 비용 절감 | Tail sampling + filter |
| Privacy 강 | redaction processor |
❌ 안티패턴
- App 이 Datadog / Honeycomb 직접: vendor lock-in. OTLP + Collector.
- Tail sampling + 작은 buffer: 의미 있는 trace 잃음. num_traces 충분.
- 모든 trace 100%: 비용 폭발. probabilistic + tail.
- PII redaction 없음: GDPR 위반.
- Collector 없는 sampling: SDK 의 head sampling 만 — 에러 trace 잃음.
- Memory_limiter 없음: OOM.
- Batch 너무 큼 (10K): latency.
🤖 LLM 활용 힌트
- App = OTLP 만, Collector 가 라우팅.
- Tail sampling = error / slow / VIP 우선.
- PII redaction + filter (health) 항상.