--- id: backend-dlq-deep title: Dead Letter Queue Deep — handling failed messages category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [backend, queue, vibe-coding] tech_stack: { language: "TS", applicable_to: ["Backend"] } applied_in: [] aliases: [DLQ, dead letter queue, retry, poison pill, message replay, error handling] --- # Dead Letter Queue (Deep) > Queue 처리 실패 message 가 무한 retry = poison pill. **DLQ = "처리 못한" 곳 + alert + manual intervention**. SQS / Kafka / RabbitMQ. ## 📖 핵심 개념 - Retry 후 fail = DLQ. - Alert + analyze + replay. - Poison pill 차단. - Bug 의 trail. ## 💻 코드 패턴 ### SQS (자동 DLQ) ```yaml # Terraform resource "aws_sqs_queue" "main" { name = "main-queue" redrive_policy = jsonencode({ deadLetterTargetArn = aws_sqs_queue.dlq.arn maxReceiveCount = 5 }) } resource "aws_sqs_queue" "dlq" { name = "main-queue-dlq" message_retention_seconds = 1209600 # 14 days } ``` → 5번 receive (실패) 후 = DLQ 자동 이동. ### RabbitMQ ```ts await channel.assertQueue('main-queue', { arguments: { 'x-dead-letter-exchange': 'dlx', 'x-dead-letter-routing-key': 'main.dead', 'x-message-ttl': 60000, // 60s 후 expire = DLQ }, }); await channel.assertExchange('dlx', 'direct'); await channel.assertQueue('main-dlq'); await channel.bindQueue('main-dlq', 'dlx', 'main.dead'); ``` → Reject + requeue=false = DLQ 이동. ### Kafka (manual) ```ts async function consume(message) { try { await process(message); await commit(); } catch (e) { if (message.attempts >= 5) { await sendToDLQ(message, e); await commit(); // skip return; } // Retry await sendToRetry(message, attempts + 1); await commit(); } } ``` → Kafka 가 DLQ 자체 제공 X. Manual. ### Retry queue + DLQ ``` Main queue → fail → Retry queue (delay 1s) → fail → Retry (delay 10s) → fail → DLQ. Exponential backoff: - 1st retry: 1s - 2nd: 10s - 3rd: 60s - 4th: 600s - 5th: DLQ ``` ```ts async function consume(msg) { try { await process(msg); } catch (e) { const attempts = msg.attempts ?? 0; if (attempts >= 5) return sendToDLQ(msg, e); const delay = Math.min(60 * 1000, 1000 * Math.pow(2, attempts)); await sendToRetryWithDelay(msg, attempts + 1, delay); } } ``` ### BullMQ (Node) ```ts import { Queue, Worker } from 'bullmq'; const worker = new Worker('main', async (job) => { await process(job.data); }, { connection, attempts: 5, backoff: { type: 'exponential', delay: 1000 }, }); worker.on('failed', (job, err) => { if (job?.attemptsMade === 5) { log.error('moved to DLQ', { jobId: job.id, error: err }); // BullMQ 가 자동 'failed' state. Manual move. } }); ``` ### Failure 분석 ``` DLQ 의 message: - Original payload - Error message - Stack trace - Retry attempts - First / last failure time ``` ```ts interface DLQMessage { originalPayload: any; originalQueue: string; failureReason: string; stackTrace: string; attemptCount: number; firstFailureAt: Date; lastFailureAt: Date; } ``` ### Alert ``` DLQ 에 message 있음 = alert (PagerDuty). - 1 message: ignore (transient). - 10+ message / 시간: alert. - 100+ message: P0. → Threshold 가 system 마다. ``` ### Manual replay ```ts // 1 message 검토 → fix → replay async function replayFromDLQ() { const msg = await dlq.receive(); // Inspect console.log(msg); // Fix root cause (deploy). // Replay await mainQueue.send(msg.originalPayload); await dlq.delete(msg); } ``` ### Replay tool ```ts // 모든 DLQ message → main queue async function replayAll() { while (true) { const msgs = await dlq.receiveBatch(10); if (msgs.length === 0) break; for (const msg of msgs) { await mainQueue.send(msg.originalPayload); await dlq.delete(msg); } } } ``` → 신중. Bug 가 fix 됐는지 확인. ### Selective replay ```ts // Specific failure type 만 replay const messages = await dlq.peek(100); const matching = messages.filter(m => m.failureReason.includes('TimeoutError')); for (const m of matching) { await mainQueue.send(m.originalPayload); } ``` ### Idempotent processing ```ts async function process(msg) { if (await db.processed.exists(msg.id)) return; // Process await db.processed.insert({ id: msg.id }); } ``` → Replay 가 안전. ### Failure category ``` Transient: - Network timeout - 503 from external - Rate limit - Lock timeout → Retry 가능 (exponential backoff). Permanent: - Bad data (validation fail) - Auth fail - Resource not found → DLQ 즉시 (retry 무의미). ``` ### Smart routing ```ts async function consume(msg) { try { await process(msg); } catch (e) { if (isTransient(e)) { await retry(msg); } else { await sendToDLQ(msg, e); // permanent — skip retry } } } ``` ### Schema validation 먼저 ```ts const result = schema.safeParse(msg.payload); if (!result.success) { // Bad data — DLQ 즉시 await sendToDLQ(msg, result.error); return; } await process(result.data); ``` → Bad data 가 main queue 안 retry. ### Versioning + DLQ ``` Old message format (v1) + new code (v2) = parse fail. DLQ: - Old message 가 잠시 모임. - Migration tool 가 v1 → v2 transform. - Replay. ``` ### Monitoring ``` DLQ depth: 매 5 min query. Alert if: - depth > 0 + 30 min (some failure) - depth > 100 (큰 failure) - depth growth rate (incident) ``` ```promql # Prometheus sqs_queue_messages_visible{queue="main-dlq"} > 100 ``` ### Retention ``` DLQ message 가 14 days 후 삭제 (SQS). - Lost. - 매 message 가 critical = 더 큰 retention. → 14 days 안 처리 / replay. ``` ### DLQ 가 또 fail ``` DLQ 에 send 가 fail (SQS down). - Main queue 가 retry 무한. - Worker 가 stuck. → DLQ-of-DLQ 또는 그냥 log + alert. ``` → `sqs:DLQ-fail` = critical alert. ### Error context ```ts async function sendToDLQ(msg, error) { await dlq.send({ ...msg, error: { message: error.message, stack: error.stack, code: error.code, timestamp: new Date(), }, consumerVersion: process.env.GIT_SHA, }); } ``` → Debug 친화. ### Per-tenant DLQ ```ts // Multi-tenant const queueName = `dlq-${tenantId}`; // 매 tenant 가 own DLQ. // 1 tenant 의 fail 가 다른 tenant 영향 X. ``` → Noisy neighbor 방지. ### LLM 의 DLQ ``` LLM API call 실패: - Rate limit → retry. - Invalid prompt → DLQ + manual. - Model 가 down → retry. → Smart routing. ``` ### Replay during deploy ``` 새 version deploy 후 DLQ replay: - 새 code 가 fix 한 bug 가 있을 수. - DLQ message 가 새 version 가 처리 OK. → Deploy 후 manual replay 가 흔한 workflow. ``` ### Cost ``` DLQ message 도 storage cost. SQS: $0.4 / 1M. Kafka: storage cost. → 작은. But 큰 system 가 GB. ``` ### 함정 ``` - DLQ 에 alert 없음: silent failure. - Retry 무한: poison pill. - Replay 없는 plan: DLQ 가 그냥 graveyard. - Idempotency 없음: replay 가 중복 effect. - Bad payload 가 retry: DLQ 즉시. - Per-message error context 없음: debug 어려움. - DLQ 가 main queue 와 같은 access: 안 됨 (separate role). ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | AWS | SQS DLQ | | RabbitMQ | DLX | | Kafka | Manual DLQ topic | | Node | BullMQ | | 매 message critical | 큰 retention + alert | | Multi-tenant | Per-tenant DLQ | | LLM API | Smart routing | | Idempotent processing | 매번 보장 | ## ❌ 안티패턴 - **DLQ 없음**: poison pill = main queue 막힘. - **No alert**: silent failure 누적. - **Infinite retry**: queue 막힘. - **Replay 없는 plan**: DLQ 가 graveyard. - **No idempotency**: replay = 중복. - **No error context**: debug 불가. - **No retention**: data lose. ## 🤖 LLM 활용 힌트 - DLQ = 처리 못한 message + alert + replay. - Smart routing (transient vs permanent). - Idempotent processing 필수. - Per-tenant DLQ 가 noisy neighbor. ## 🔗 관련 문서 - [[Messaging_DLQ_Patterns]] - [[Backend_Idempotent_Consumer]] - [[Backend_Idempotency_Deep]]