Files
2nd/10_Wiki/Topics/Coding/Backend_DLQ_Deep.md
T
2026-05-10 22:08:15 +09:00

413 lines
8.0 KiB
Markdown

---
id: backend-dlq-deep
title: Dead Letter Queue Deep — handling failed messages
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [backend, queue, vibe-coding]
tech_stack: { language: "TS", applicable_to: ["Backend"] }
applied_in: []
aliases: [DLQ, dead letter queue, retry, poison pill, message replay, error handling]
---
# Dead Letter Queue (Deep)
> Queue 처리 실패 message 가 무한 retry = poison pill. **DLQ = "처리 못한" 곳 + alert + manual intervention**. SQS / Kafka / RabbitMQ.
## 📖 핵심 개념
- Retry 후 fail = DLQ.
- Alert + analyze + replay.
- Poison pill 차단.
- Bug 의 trail.
## 💻 코드 패턴
### SQS (자동 DLQ)
```yaml
# Terraform
resource "aws_sqs_queue" "main" {
name = "main-queue"
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq.arn
maxReceiveCount = 5
})
}
resource "aws_sqs_queue" "dlq" {
name = "main-queue-dlq"
message_retention_seconds = 1209600 # 14 days
}
```
→ 5번 receive (실패) 후 = DLQ 자동 이동.
### RabbitMQ
```ts
await channel.assertQueue('main-queue', {
arguments: {
'x-dead-letter-exchange': 'dlx',
'x-dead-letter-routing-key': 'main.dead',
'x-message-ttl': 60000, // 60s 후 expire = DLQ
},
});
await channel.assertExchange('dlx', 'direct');
await channel.assertQueue('main-dlq');
await channel.bindQueue('main-dlq', 'dlx', 'main.dead');
```
→ Reject + requeue=false = DLQ 이동.
### Kafka (manual)
```ts
async function consume(message) {
try {
await process(message);
await commit();
} catch (e) {
if (message.attempts >= 5) {
await sendToDLQ(message, e);
await commit(); // skip
return;
}
// Retry
await sendToRetry(message, attempts + 1);
await commit();
}
}
```
→ Kafka 가 DLQ 자체 제공 X. Manual.
### Retry queue + DLQ
```
Main queue → fail → Retry queue (delay 1s) → fail → Retry (delay 10s) → fail → DLQ.
Exponential backoff:
- 1st retry: 1s
- 2nd: 10s
- 3rd: 60s
- 4th: 600s
- 5th: DLQ
```
```ts
async function consume(msg) {
try {
await process(msg);
} catch (e) {
const attempts = msg.attempts ?? 0;
if (attempts >= 5) return sendToDLQ(msg, e);
const delay = Math.min(60 * 1000, 1000 * Math.pow(2, attempts));
await sendToRetryWithDelay(msg, attempts + 1, delay);
}
}
```
### BullMQ (Node)
```ts
import { Queue, Worker } from 'bullmq';
const worker = new Worker('main', async (job) => {
await process(job.data);
}, {
connection,
attempts: 5,
backoff: { type: 'exponential', delay: 1000 },
});
worker.on('failed', (job, err) => {
if (job?.attemptsMade === 5) {
log.error('moved to DLQ', { jobId: job.id, error: err });
// BullMQ 가 자동 'failed' state. Manual move.
}
});
```
### Failure 분석
```
DLQ 의 message:
- Original payload
- Error message
- Stack trace
- Retry attempts
- First / last failure time
```
```ts
interface DLQMessage {
originalPayload: any;
originalQueue: string;
failureReason: string;
stackTrace: string;
attemptCount: number;
firstFailureAt: Date;
lastFailureAt: Date;
}
```
### Alert
```
DLQ 에 message 있음 = alert (PagerDuty).
- 1 message: ignore (transient).
- 10+ message / 시간: alert.
- 100+ message: P0.
→ Threshold 가 system 마다.
```
### Manual replay
```ts
// 1 message 검토 → fix → replay
async function replayFromDLQ() {
const msg = await dlq.receive();
// Inspect
console.log(msg);
// Fix root cause (deploy).
// Replay
await mainQueue.send(msg.originalPayload);
await dlq.delete(msg);
}
```
### Replay tool
```ts
// 모든 DLQ message → main queue
async function replayAll() {
while (true) {
const msgs = await dlq.receiveBatch(10);
if (msgs.length === 0) break;
for (const msg of msgs) {
await mainQueue.send(msg.originalPayload);
await dlq.delete(msg);
}
}
}
```
→ 신중. Bug 가 fix 됐는지 확인.
### Selective replay
```ts
// Specific failure type 만 replay
const messages = await dlq.peek(100);
const matching = messages.filter(m => m.failureReason.includes('TimeoutError'));
for (const m of matching) {
await mainQueue.send(m.originalPayload);
}
```
### Idempotent processing
```ts
async function process(msg) {
if (await db.processed.exists(msg.id)) return;
// Process
await db.processed.insert({ id: msg.id });
}
```
→ Replay 가 안전.
### Failure category
```
Transient:
- Network timeout
- 503 from external
- Rate limit
- Lock timeout
→ Retry 가능 (exponential backoff).
Permanent:
- Bad data (validation fail)
- Auth fail
- Resource not found
→ DLQ 즉시 (retry 무의미).
```
### Smart routing
```ts
async function consume(msg) {
try {
await process(msg);
} catch (e) {
if (isTransient(e)) {
await retry(msg);
} else {
await sendToDLQ(msg, e); // permanent — skip retry
}
}
}
```
### Schema validation 먼저
```ts
const result = schema.safeParse(msg.payload);
if (!result.success) {
// Bad data — DLQ 즉시
await sendToDLQ(msg, result.error);
return;
}
await process(result.data);
```
→ Bad data 가 main queue 안 retry.
### Versioning + DLQ
```
Old message format (v1) + new code (v2) = parse fail.
DLQ:
- Old message 가 잠시 모임.
- Migration tool 가 v1 → v2 transform.
- Replay.
```
### Monitoring
```
DLQ depth: 매 5 min query.
Alert if:
- depth > 0 + 30 min (some failure)
- depth > 100 (큰 failure)
- depth growth rate (incident)
```
```promql
# Prometheus
sqs_queue_messages_visible{queue="main-dlq"} > 100
```
### Retention
```
DLQ message 가 14 days 후 삭제 (SQS).
- Lost.
- 매 message 가 critical = 더 큰 retention.
→ 14 days 안 처리 / replay.
```
### DLQ 가 또 fail
```
DLQ 에 send 가 fail (SQS down).
- Main queue 가 retry 무한.
- Worker 가 stuck.
→ DLQ-of-DLQ 또는 그냥 log + alert.
```
`sqs:DLQ-fail` = critical alert.
### Error context
```ts
async function sendToDLQ(msg, error) {
await dlq.send({
...msg,
error: {
message: error.message,
stack: error.stack,
code: error.code,
timestamp: new Date(),
},
consumerVersion: process.env.GIT_SHA,
});
}
```
→ Debug 친화.
### Per-tenant DLQ
```ts
// Multi-tenant
const queueName = `dlq-${tenantId}`;
// 매 tenant 가 own DLQ.
// 1 tenant 의 fail 가 다른 tenant 영향 X.
```
→ Noisy neighbor 방지.
### LLM 의 DLQ
```
LLM API call 실패:
- Rate limit → retry.
- Invalid prompt → DLQ + manual.
- Model 가 down → retry.
→ Smart routing.
```
### Replay during deploy
```
새 version deploy 후 DLQ replay:
- 새 code 가 fix 한 bug 가 있을 수.
- DLQ message 가 새 version 가 처리 OK.
→ Deploy 후 manual replay 가 흔한 workflow.
```
### Cost
```
DLQ message 도 storage cost.
SQS: $0.4 / 1M.
Kafka: storage cost.
→ 작은. But 큰 system 가 GB.
```
### 함정
```
- DLQ 에 alert 없음: silent failure.
- Retry 무한: poison pill.
- Replay 없는 plan: DLQ 가 그냥 graveyard.
- Idempotency 없음: replay 가 중복 effect.
- Bad payload 가 retry: DLQ 즉시.
- Per-message error context 없음: debug 어려움.
- DLQ 가 main queue 와 같은 access: 안 됨 (separate role).
```
## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| AWS | SQS DLQ |
| RabbitMQ | DLX |
| Kafka | Manual DLQ topic |
| Node | BullMQ |
| 매 message critical | 큰 retention + alert |
| Multi-tenant | Per-tenant DLQ |
| LLM API | Smart routing |
| Idempotent processing | 매번 보장 |
## ❌ 안티패턴
- **DLQ 없음**: poison pill = main queue 막힘.
- **No alert**: silent failure 누적.
- **Infinite retry**: queue 막힘.
- **Replay 없는 plan**: DLQ 가 graveyard.
- **No idempotency**: replay = 중복.
- **No error context**: debug 불가.
- **No retention**: data lose.
## 🤖 LLM 활용 힌트
- DLQ = 처리 못한 message + alert + replay.
- Smart routing (transient vs permanent).
- Idempotent processing 필수.
- Per-tenant DLQ 가 noisy neighbor.
## 🔗 관련 문서
- [[Messaging_DLQ_Patterns]]
- [[Backend_Idempotent_Consumer]]
- [[Backend_Idempotency_Deep]]