8.0 KiB
8.0 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| backend-dlq-deep | Dead Letter Queue Deep — handling failed messages | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Dead Letter Queue (Deep)
Queue 처리 실패 message 가 무한 retry = poison pill. DLQ = "처리 못한" 곳 + alert + manual intervention. SQS / Kafka / RabbitMQ.
📖 핵심 개념
- Retry 후 fail = DLQ.
- Alert + analyze + replay.
- Poison pill 차단.
- Bug 의 trail.
💻 코드 패턴
SQS (자동 DLQ)
# Terraform
resource "aws_sqs_queue" "main" {
name = "main-queue"
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq.arn
maxReceiveCount = 5
})
}
resource "aws_sqs_queue" "dlq" {
name = "main-queue-dlq"
message_retention_seconds = 1209600 # 14 days
}
→ 5번 receive (실패) 후 = DLQ 자동 이동.
RabbitMQ
await channel.assertQueue('main-queue', {
arguments: {
'x-dead-letter-exchange': 'dlx',
'x-dead-letter-routing-key': 'main.dead',
'x-message-ttl': 60000, // 60s 후 expire = DLQ
},
});
await channel.assertExchange('dlx', 'direct');
await channel.assertQueue('main-dlq');
await channel.bindQueue('main-dlq', 'dlx', 'main.dead');
→ Reject + requeue=false = DLQ 이동.
Kafka (manual)
async function consume(message) {
try {
await process(message);
await commit();
} catch (e) {
if (message.attempts >= 5) {
await sendToDLQ(message, e);
await commit(); // skip
return;
}
// Retry
await sendToRetry(message, attempts + 1);
await commit();
}
}
→ Kafka 가 DLQ 자체 제공 X. Manual.
Retry queue + DLQ
Main queue → fail → Retry queue (delay 1s) → fail → Retry (delay 10s) → fail → DLQ.
Exponential backoff:
- 1st retry: 1s
- 2nd: 10s
- 3rd: 60s
- 4th: 600s
- 5th: DLQ
async function consume(msg) {
try {
await process(msg);
} catch (e) {
const attempts = msg.attempts ?? 0;
if (attempts >= 5) return sendToDLQ(msg, e);
const delay = Math.min(60 * 1000, 1000 * Math.pow(2, attempts));
await sendToRetryWithDelay(msg, attempts + 1, delay);
}
}
BullMQ (Node)
import { Queue, Worker } from 'bullmq';
const worker = new Worker('main', async (job) => {
await process(job.data);
}, {
connection,
attempts: 5,
backoff: { type: 'exponential', delay: 1000 },
});
worker.on('failed', (job, err) => {
if (job?.attemptsMade === 5) {
log.error('moved to DLQ', { jobId: job.id, error: err });
// BullMQ 가 자동 'failed' state. Manual move.
}
});
Failure 분석
DLQ 의 message:
- Original payload
- Error message
- Stack trace
- Retry attempts
- First / last failure time
interface DLQMessage {
originalPayload: any;
originalQueue: string;
failureReason: string;
stackTrace: string;
attemptCount: number;
firstFailureAt: Date;
lastFailureAt: Date;
}
Alert
DLQ 에 message 있음 = alert (PagerDuty).
- 1 message: ignore (transient).
- 10+ message / 시간: alert.
- 100+ message: P0.
→ Threshold 가 system 마다.
Manual replay
// 1 message 검토 → fix → replay
async function replayFromDLQ() {
const msg = await dlq.receive();
// Inspect
console.log(msg);
// Fix root cause (deploy).
// Replay
await mainQueue.send(msg.originalPayload);
await dlq.delete(msg);
}
Replay tool
// 모든 DLQ message → main queue
async function replayAll() {
while (true) {
const msgs = await dlq.receiveBatch(10);
if (msgs.length === 0) break;
for (const msg of msgs) {
await mainQueue.send(msg.originalPayload);
await dlq.delete(msg);
}
}
}
→ 신중. Bug 가 fix 됐는지 확인.
Selective replay
// Specific failure type 만 replay
const messages = await dlq.peek(100);
const matching = messages.filter(m => m.failureReason.includes('TimeoutError'));
for (const m of matching) {
await mainQueue.send(m.originalPayload);
}
Idempotent processing
async function process(msg) {
if (await db.processed.exists(msg.id)) return;
// Process
await db.processed.insert({ id: msg.id });
}
→ Replay 가 안전.
Failure category
Transient:
- Network timeout
- 503 from external
- Rate limit
- Lock timeout
→ Retry 가능 (exponential backoff).
Permanent:
- Bad data (validation fail)
- Auth fail
- Resource not found
→ DLQ 즉시 (retry 무의미).
Smart routing
async function consume(msg) {
try {
await process(msg);
} catch (e) {
if (isTransient(e)) {
await retry(msg);
} else {
await sendToDLQ(msg, e); // permanent — skip retry
}
}
}
Schema validation 먼저
const result = schema.safeParse(msg.payload);
if (!result.success) {
// Bad data — DLQ 즉시
await sendToDLQ(msg, result.error);
return;
}
await process(result.data);
→ Bad data 가 main queue 안 retry.
Versioning + DLQ
Old message format (v1) + new code (v2) = parse fail.
DLQ:
- Old message 가 잠시 모임.
- Migration tool 가 v1 → v2 transform.
- Replay.
Monitoring
DLQ depth: 매 5 min query.
Alert if:
- depth > 0 + 30 min (some failure)
- depth > 100 (큰 failure)
- depth growth rate (incident)
# Prometheus
sqs_queue_messages_visible{queue="main-dlq"} > 100
Retention
DLQ message 가 14 days 후 삭제 (SQS).
- Lost.
- 매 message 가 critical = 더 큰 retention.
→ 14 days 안 처리 / replay.
DLQ 가 또 fail
DLQ 에 send 가 fail (SQS down).
- Main queue 가 retry 무한.
- Worker 가 stuck.
→ DLQ-of-DLQ 또는 그냥 log + alert.
→ sqs:DLQ-fail = critical alert.
Error context
async function sendToDLQ(msg, error) {
await dlq.send({
...msg,
error: {
message: error.message,
stack: error.stack,
code: error.code,
timestamp: new Date(),
},
consumerVersion: process.env.GIT_SHA,
});
}
→ Debug 친화.
Per-tenant DLQ
// Multi-tenant
const queueName = `dlq-${tenantId}`;
// 매 tenant 가 own DLQ.
// 1 tenant 의 fail 가 다른 tenant 영향 X.
→ Noisy neighbor 방지.
LLM 의 DLQ
LLM API call 실패:
- Rate limit → retry.
- Invalid prompt → DLQ + manual.
- Model 가 down → retry.
→ Smart routing.
Replay during deploy
새 version deploy 후 DLQ replay:
- 새 code 가 fix 한 bug 가 있을 수.
- DLQ message 가 새 version 가 처리 OK.
→ Deploy 후 manual replay 가 흔한 workflow.
Cost
DLQ message 도 storage cost.
SQS: $0.4 / 1M.
Kafka: storage cost.
→ 작은. But 큰 system 가 GB.
함정
- DLQ 에 alert 없음: silent failure.
- Retry 무한: poison pill.
- Replay 없는 plan: DLQ 가 그냥 graveyard.
- Idempotency 없음: replay 가 중복 effect.
- Bad payload 가 retry: DLQ 즉시.
- Per-message error context 없음: debug 어려움.
- DLQ 가 main queue 와 같은 access: 안 됨 (separate role).
🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| AWS | SQS DLQ |
| RabbitMQ | DLX |
| Kafka | Manual DLQ topic |
| Node | BullMQ |
| 매 message critical | 큰 retention + alert |
| Multi-tenant | Per-tenant DLQ |
| LLM API | Smart routing |
| Idempotent processing | 매번 보장 |
❌ 안티패턴
- DLQ 없음: poison pill = main queue 막힘.
- No alert: silent failure 누적.
- Infinite retry: queue 막힘.
- Replay 없는 plan: DLQ 가 graveyard.
- No idempotency: replay = 중복.
- No error context: debug 불가.
- No retention: data lose.
🤖 LLM 활용 힌트
- DLQ = 처리 못한 message + alert + replay.
- Smart routing (transient vs permanent).
- Idempotent processing 필수.
- Per-tenant DLQ 가 noisy neighbor.