413 lines
8.0 KiB
Markdown
413 lines
8.0 KiB
Markdown
---
|
|
id: backend-dlq-deep
|
|
title: Dead Letter Queue Deep — handling failed messages
|
|
category: Coding
|
|
status: draft
|
|
source_trust_level: B
|
|
verification_status: conceptual
|
|
created_at: 2026-05-09
|
|
updated_at: 2026-05-09
|
|
tags: [backend, queue, vibe-coding]
|
|
tech_stack: { language: "TS", applicable_to: ["Backend"] }
|
|
applied_in: []
|
|
aliases: [DLQ, dead letter queue, retry, poison pill, message replay, error handling]
|
|
---
|
|
|
|
# Dead Letter Queue (Deep)
|
|
|
|
> Queue 처리 실패 message 가 무한 retry = poison pill. **DLQ = "처리 못한" 곳 + alert + manual intervention**. SQS / Kafka / RabbitMQ.
|
|
|
|
## 📖 핵심 개념
|
|
- Retry 후 fail = DLQ.
|
|
- Alert + analyze + replay.
|
|
- Poison pill 차단.
|
|
- Bug 의 trail.
|
|
|
|
## 💻 코드 패턴
|
|
|
|
### SQS (자동 DLQ)
|
|
```yaml
|
|
# Terraform
|
|
resource "aws_sqs_queue" "main" {
|
|
name = "main-queue"
|
|
redrive_policy = jsonencode({
|
|
deadLetterTargetArn = aws_sqs_queue.dlq.arn
|
|
maxReceiveCount = 5
|
|
})
|
|
}
|
|
|
|
resource "aws_sqs_queue" "dlq" {
|
|
name = "main-queue-dlq"
|
|
message_retention_seconds = 1209600 # 14 days
|
|
}
|
|
```
|
|
|
|
→ 5번 receive (실패) 후 = DLQ 자동 이동.
|
|
|
|
### RabbitMQ
|
|
```ts
|
|
await channel.assertQueue('main-queue', {
|
|
arguments: {
|
|
'x-dead-letter-exchange': 'dlx',
|
|
'x-dead-letter-routing-key': 'main.dead',
|
|
'x-message-ttl': 60000, // 60s 후 expire = DLQ
|
|
},
|
|
});
|
|
|
|
await channel.assertExchange('dlx', 'direct');
|
|
await channel.assertQueue('main-dlq');
|
|
await channel.bindQueue('main-dlq', 'dlx', 'main.dead');
|
|
```
|
|
|
|
→ Reject + requeue=false = DLQ 이동.
|
|
|
|
### Kafka (manual)
|
|
```ts
|
|
async function consume(message) {
|
|
try {
|
|
await process(message);
|
|
await commit();
|
|
} catch (e) {
|
|
if (message.attempts >= 5) {
|
|
await sendToDLQ(message, e);
|
|
await commit(); // skip
|
|
return;
|
|
}
|
|
|
|
// Retry
|
|
await sendToRetry(message, attempts + 1);
|
|
await commit();
|
|
}
|
|
}
|
|
```
|
|
|
|
→ Kafka 가 DLQ 자체 제공 X. Manual.
|
|
|
|
### Retry queue + DLQ
|
|
```
|
|
Main queue → fail → Retry queue (delay 1s) → fail → Retry (delay 10s) → fail → DLQ.
|
|
|
|
Exponential backoff:
|
|
- 1st retry: 1s
|
|
- 2nd: 10s
|
|
- 3rd: 60s
|
|
- 4th: 600s
|
|
- 5th: DLQ
|
|
```
|
|
|
|
```ts
|
|
async function consume(msg) {
|
|
try {
|
|
await process(msg);
|
|
} catch (e) {
|
|
const attempts = msg.attempts ?? 0;
|
|
if (attempts >= 5) return sendToDLQ(msg, e);
|
|
|
|
const delay = Math.min(60 * 1000, 1000 * Math.pow(2, attempts));
|
|
await sendToRetryWithDelay(msg, attempts + 1, delay);
|
|
}
|
|
}
|
|
```
|
|
|
|
### BullMQ (Node)
|
|
```ts
|
|
import { Queue, Worker } from 'bullmq';
|
|
|
|
const worker = new Worker('main', async (job) => {
|
|
await process(job.data);
|
|
}, {
|
|
connection,
|
|
attempts: 5,
|
|
backoff: { type: 'exponential', delay: 1000 },
|
|
});
|
|
|
|
worker.on('failed', (job, err) => {
|
|
if (job?.attemptsMade === 5) {
|
|
log.error('moved to DLQ', { jobId: job.id, error: err });
|
|
// BullMQ 가 자동 'failed' state. Manual move.
|
|
}
|
|
});
|
|
```
|
|
|
|
### Failure 분석
|
|
```
|
|
DLQ 의 message:
|
|
- Original payload
|
|
- Error message
|
|
- Stack trace
|
|
- Retry attempts
|
|
- First / last failure time
|
|
```
|
|
|
|
```ts
|
|
interface DLQMessage {
|
|
originalPayload: any;
|
|
originalQueue: string;
|
|
failureReason: string;
|
|
stackTrace: string;
|
|
attemptCount: number;
|
|
firstFailureAt: Date;
|
|
lastFailureAt: Date;
|
|
}
|
|
```
|
|
|
|
### Alert
|
|
```
|
|
DLQ 에 message 있음 = alert (PagerDuty).
|
|
- 1 message: ignore (transient).
|
|
- 10+ message / 시간: alert.
|
|
- 100+ message: P0.
|
|
|
|
→ Threshold 가 system 마다.
|
|
```
|
|
|
|
### Manual replay
|
|
```ts
|
|
// 1 message 검토 → fix → replay
|
|
async function replayFromDLQ() {
|
|
const msg = await dlq.receive();
|
|
|
|
// Inspect
|
|
console.log(msg);
|
|
|
|
// Fix root cause (deploy).
|
|
|
|
// Replay
|
|
await mainQueue.send(msg.originalPayload);
|
|
await dlq.delete(msg);
|
|
}
|
|
```
|
|
|
|
### Replay tool
|
|
```ts
|
|
// 모든 DLQ message → main queue
|
|
async function replayAll() {
|
|
while (true) {
|
|
const msgs = await dlq.receiveBatch(10);
|
|
if (msgs.length === 0) break;
|
|
|
|
for (const msg of msgs) {
|
|
await mainQueue.send(msg.originalPayload);
|
|
await dlq.delete(msg);
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
→ 신중. Bug 가 fix 됐는지 확인.
|
|
|
|
### Selective replay
|
|
```ts
|
|
// Specific failure type 만 replay
|
|
const messages = await dlq.peek(100);
|
|
const matching = messages.filter(m => m.failureReason.includes('TimeoutError'));
|
|
|
|
for (const m of matching) {
|
|
await mainQueue.send(m.originalPayload);
|
|
}
|
|
```
|
|
|
|
### Idempotent processing
|
|
```ts
|
|
async function process(msg) {
|
|
if (await db.processed.exists(msg.id)) return;
|
|
|
|
// Process
|
|
|
|
await db.processed.insert({ id: msg.id });
|
|
}
|
|
```
|
|
|
|
→ Replay 가 안전.
|
|
|
|
### Failure category
|
|
```
|
|
Transient:
|
|
- Network timeout
|
|
- 503 from external
|
|
- Rate limit
|
|
- Lock timeout
|
|
|
|
→ Retry 가능 (exponential backoff).
|
|
|
|
Permanent:
|
|
- Bad data (validation fail)
|
|
- Auth fail
|
|
- Resource not found
|
|
|
|
→ DLQ 즉시 (retry 무의미).
|
|
```
|
|
|
|
### Smart routing
|
|
```ts
|
|
async function consume(msg) {
|
|
try {
|
|
await process(msg);
|
|
} catch (e) {
|
|
if (isTransient(e)) {
|
|
await retry(msg);
|
|
} else {
|
|
await sendToDLQ(msg, e); // permanent — skip retry
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Schema validation 먼저
|
|
```ts
|
|
const result = schema.safeParse(msg.payload);
|
|
if (!result.success) {
|
|
// Bad data — DLQ 즉시
|
|
await sendToDLQ(msg, result.error);
|
|
return;
|
|
}
|
|
|
|
await process(result.data);
|
|
```
|
|
|
|
→ Bad data 가 main queue 안 retry.
|
|
|
|
### Versioning + DLQ
|
|
```
|
|
Old message format (v1) + new code (v2) = parse fail.
|
|
|
|
DLQ:
|
|
- Old message 가 잠시 모임.
|
|
- Migration tool 가 v1 → v2 transform.
|
|
- Replay.
|
|
```
|
|
|
|
### Monitoring
|
|
```
|
|
DLQ depth: 매 5 min query.
|
|
Alert if:
|
|
- depth > 0 + 30 min (some failure)
|
|
- depth > 100 (큰 failure)
|
|
- depth growth rate (incident)
|
|
```
|
|
|
|
```promql
|
|
# Prometheus
|
|
sqs_queue_messages_visible{queue="main-dlq"} > 100
|
|
```
|
|
|
|
### Retention
|
|
```
|
|
DLQ message 가 14 days 후 삭제 (SQS).
|
|
- Lost.
|
|
- 매 message 가 critical = 더 큰 retention.
|
|
|
|
→ 14 days 안 처리 / replay.
|
|
```
|
|
|
|
### DLQ 가 또 fail
|
|
```
|
|
DLQ 에 send 가 fail (SQS down).
|
|
- Main queue 가 retry 무한.
|
|
- Worker 가 stuck.
|
|
|
|
→ DLQ-of-DLQ 또는 그냥 log + alert.
|
|
```
|
|
|
|
→ `sqs:DLQ-fail` = critical alert.
|
|
|
|
### Error context
|
|
```ts
|
|
async function sendToDLQ(msg, error) {
|
|
await dlq.send({
|
|
...msg,
|
|
error: {
|
|
message: error.message,
|
|
stack: error.stack,
|
|
code: error.code,
|
|
timestamp: new Date(),
|
|
},
|
|
consumerVersion: process.env.GIT_SHA,
|
|
});
|
|
}
|
|
```
|
|
|
|
→ Debug 친화.
|
|
|
|
### Per-tenant DLQ
|
|
```ts
|
|
// Multi-tenant
|
|
const queueName = `dlq-${tenantId}`;
|
|
|
|
// 매 tenant 가 own DLQ.
|
|
// 1 tenant 의 fail 가 다른 tenant 영향 X.
|
|
```
|
|
|
|
→ Noisy neighbor 방지.
|
|
|
|
### LLM 의 DLQ
|
|
```
|
|
LLM API call 실패:
|
|
- Rate limit → retry.
|
|
- Invalid prompt → DLQ + manual.
|
|
- Model 가 down → retry.
|
|
|
|
→ Smart routing.
|
|
```
|
|
|
|
### Replay during deploy
|
|
```
|
|
새 version deploy 후 DLQ replay:
|
|
- 새 code 가 fix 한 bug 가 있을 수.
|
|
- DLQ message 가 새 version 가 처리 OK.
|
|
|
|
→ Deploy 후 manual replay 가 흔한 workflow.
|
|
```
|
|
|
|
### Cost
|
|
```
|
|
DLQ message 도 storage cost.
|
|
SQS: $0.4 / 1M.
|
|
Kafka: storage cost.
|
|
|
|
→ 작은. But 큰 system 가 GB.
|
|
```
|
|
|
|
### 함정
|
|
```
|
|
- DLQ 에 alert 없음: silent failure.
|
|
- Retry 무한: poison pill.
|
|
- Replay 없는 plan: DLQ 가 그냥 graveyard.
|
|
- Idempotency 없음: replay 가 중복 effect.
|
|
- Bad payload 가 retry: DLQ 즉시.
|
|
- Per-message error context 없음: debug 어려움.
|
|
- DLQ 가 main queue 와 같은 access: 안 됨 (separate role).
|
|
```
|
|
|
|
## 🤔 의사결정 기준
|
|
| 상황 | 추천 |
|
|
|---|---|
|
|
| AWS | SQS DLQ |
|
|
| RabbitMQ | DLX |
|
|
| Kafka | Manual DLQ topic |
|
|
| Node | BullMQ |
|
|
| 매 message critical | 큰 retention + alert |
|
|
| Multi-tenant | Per-tenant DLQ |
|
|
| LLM API | Smart routing |
|
|
| Idempotent processing | 매번 보장 |
|
|
|
|
## ❌ 안티패턴
|
|
- **DLQ 없음**: poison pill = main queue 막힘.
|
|
- **No alert**: silent failure 누적.
|
|
- **Infinite retry**: queue 막힘.
|
|
- **Replay 없는 plan**: DLQ 가 graveyard.
|
|
- **No idempotency**: replay = 중복.
|
|
- **No error context**: debug 불가.
|
|
- **No retention**: data lose.
|
|
|
|
## 🤖 LLM 활용 힌트
|
|
- DLQ = 처리 못한 message + alert + replay.
|
|
- Smart routing (transient vs permanent).
|
|
- Idempotent processing 필수.
|
|
- Per-tenant DLQ 가 noisy neighbor.
|
|
|
|
## 🔗 관련 문서
|
|
- [[Messaging_DLQ_Patterns]]
|
|
- [[Backend_Idempotent_Consumer]]
|
|
- [[Backend_Idempotency_Deep]]
|