Files
2nd/10_Wiki/Topics/Coding/Backend_DLQ_Deep.md
T
2026-05-10 22:08:15 +09:00

8.0 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
backend-dlq-deep Dead Letter Queue Deep — handling failed messages Coding draft B conceptual 2026-05-09 2026-05-09
backend
queue
vibe-coding
language applicable_to
TS
Backend
DLQ
dead letter queue
retry
poison pill
message replay
error handling

Dead Letter Queue (Deep)

Queue 처리 실패 message 가 무한 retry = poison pill. DLQ = "처리 못한" 곳 + alert + manual intervention. SQS / Kafka / RabbitMQ.

📖 핵심 개념

  • Retry 후 fail = DLQ.
  • Alert + analyze + replay.
  • Poison pill 차단.
  • Bug 의 trail.

💻 코드 패턴

SQS (자동 DLQ)

# Terraform
resource "aws_sqs_queue" "main" {
  name = "main-queue"
  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq.arn
    maxReceiveCount     = 5
  })
}

resource "aws_sqs_queue" "dlq" {
  name = "main-queue-dlq"
  message_retention_seconds = 1209600  # 14 days
}

→ 5번 receive (실패) 후 = DLQ 자동 이동.

RabbitMQ

await channel.assertQueue('main-queue', {
  arguments: {
    'x-dead-letter-exchange': 'dlx',
    'x-dead-letter-routing-key': 'main.dead',
    'x-message-ttl': 60000,         // 60s 후 expire = DLQ
  },
});

await channel.assertExchange('dlx', 'direct');
await channel.assertQueue('main-dlq');
await channel.bindQueue('main-dlq', 'dlx', 'main.dead');

→ Reject + requeue=false = DLQ 이동.

Kafka (manual)

async function consume(message) {
  try {
    await process(message);
    await commit();
  } catch (e) {
    if (message.attempts >= 5) {
      await sendToDLQ(message, e);
      await commit();    // skip
      return;
    }
    
    // Retry
    await sendToRetry(message, attempts + 1);
    await commit();
  }
}

→ Kafka 가 DLQ 자체 제공 X. Manual.

Retry queue + DLQ

Main queue → fail → Retry queue (delay 1s) → fail → Retry (delay 10s) → fail → DLQ.

Exponential backoff:
- 1st retry: 1s
- 2nd: 10s
- 3rd: 60s
- 4th: 600s
- 5th: DLQ
async function consume(msg) {
  try {
    await process(msg);
  } catch (e) {
    const attempts = msg.attempts ?? 0;
    if (attempts >= 5) return sendToDLQ(msg, e);
    
    const delay = Math.min(60 * 1000, 1000 * Math.pow(2, attempts));
    await sendToRetryWithDelay(msg, attempts + 1, delay);
  }
}

BullMQ (Node)

import { Queue, Worker } from 'bullmq';

const worker = new Worker('main', async (job) => {
  await process(job.data);
}, {
  connection,
  attempts: 5,
  backoff: { type: 'exponential', delay: 1000 },
});

worker.on('failed', (job, err) => {
  if (job?.attemptsMade === 5) {
    log.error('moved to DLQ', { jobId: job.id, error: err });
    // BullMQ 가 자동 'failed' state. Manual move.
  }
});

Failure 분석

DLQ 의 message:
- Original payload
- Error message
- Stack trace
- Retry attempts
- First / last failure time
interface DLQMessage {
  originalPayload: any;
  originalQueue: string;
  failureReason: string;
  stackTrace: string;
  attemptCount: number;
  firstFailureAt: Date;
  lastFailureAt: Date;
}

Alert

DLQ 에 message 있음 = alert (PagerDuty).
- 1 message: ignore (transient).
- 10+ message / 시간: alert.
- 100+ message: P0.

→ Threshold 가 system 마다.

Manual replay

// 1 message 검토 → fix → replay
async function replayFromDLQ() {
  const msg = await dlq.receive();
  
  // Inspect
  console.log(msg);
  
  // Fix root cause (deploy).
  
  // Replay
  await mainQueue.send(msg.originalPayload);
  await dlq.delete(msg);
}

Replay tool

// 모든 DLQ message → main queue
async function replayAll() {
  while (true) {
    const msgs = await dlq.receiveBatch(10);
    if (msgs.length === 0) break;
    
    for (const msg of msgs) {
      await mainQueue.send(msg.originalPayload);
      await dlq.delete(msg);
    }
  }
}

→ 신중. Bug 가 fix 됐는지 확인.

Selective replay

// Specific failure type 만 replay
const messages = await dlq.peek(100);
const matching = messages.filter(m => m.failureReason.includes('TimeoutError'));

for (const m of matching) {
  await mainQueue.send(m.originalPayload);
}

Idempotent processing

async function process(msg) {
  if (await db.processed.exists(msg.id)) return;
  
  // Process
  
  await db.processed.insert({ id: msg.id });
}

→ Replay 가 안전.

Failure category

Transient:
- Network timeout
- 503 from external
- Rate limit
- Lock timeout

→ Retry 가능 (exponential backoff).

Permanent:
- Bad data (validation fail)
- Auth fail
- Resource not found

→ DLQ 즉시 (retry 무의미).

Smart routing

async function consume(msg) {
  try {
    await process(msg);
  } catch (e) {
    if (isTransient(e)) {
      await retry(msg);
    } else {
      await sendToDLQ(msg, e);  // permanent — skip retry
    }
  }
}

Schema validation 먼저

const result = schema.safeParse(msg.payload);
if (!result.success) {
  // Bad data — DLQ 즉시
  await sendToDLQ(msg, result.error);
  return;
}

await process(result.data);

→ Bad data 가 main queue 안 retry.

Versioning + DLQ

Old message format (v1) + new code (v2) = parse fail.

DLQ:
- Old message 가 잠시 모임.
- Migration tool 가 v1 → v2 transform.
- Replay.

Monitoring

DLQ depth: 매 5 min query.
Alert if:
- depth > 0 + 30 min (some failure)
- depth > 100 (큰 failure)
- depth growth rate (incident)
# Prometheus
sqs_queue_messages_visible{queue="main-dlq"} > 100

Retention

DLQ message 가 14 days 후 삭제 (SQS).
- Lost.
- 매 message 가 critical = 더 큰 retention.

→ 14 days 안 처리 / replay.

DLQ 가 또 fail

DLQ 에 send 가 fail (SQS down).
- Main queue 가 retry 무한.
- Worker 가 stuck.

→ DLQ-of-DLQ 또는 그냥 log + alert.

sqs:DLQ-fail = critical alert.

Error context

async function sendToDLQ(msg, error) {
  await dlq.send({
    ...msg,
    error: {
      message: error.message,
      stack: error.stack,
      code: error.code,
      timestamp: new Date(),
    },
    consumerVersion: process.env.GIT_SHA,
  });
}

→ Debug 친화.

Per-tenant DLQ

// Multi-tenant
const queueName = `dlq-${tenantId}`;

// 매 tenant 가 own DLQ.
// 1 tenant 의 fail 가 다른 tenant 영향 X.

→ Noisy neighbor 방지.

LLM 의 DLQ

LLM API call 실패:
- Rate limit → retry.
- Invalid prompt → DLQ + manual.
- Model 가 down → retry.

→ Smart routing.

Replay during deploy

새 version deploy 후 DLQ replay:
- 새 code 가 fix 한 bug 가 있을 수.
- DLQ message 가 새 version 가 처리 OK.

→ Deploy 후 manual replay 가 흔한 workflow.

Cost

DLQ message 도 storage cost.
SQS: $0.4 / 1M.
Kafka: storage cost.

→ 작은. But 큰 system 가 GB.

함정

- DLQ 에 alert 없음: silent failure.
- Retry 무한: poison pill.
- Replay 없는 plan: DLQ 가 그냥 graveyard.
- Idempotency 없음: replay 가 중복 effect.
- Bad payload 가 retry: DLQ 즉시.
- Per-message error context 없음: debug 어려움.
- DLQ 가 main queue 와 같은 access: 안 됨 (separate role).

🤔 의사결정 기준

상황 추천
AWS SQS DLQ
RabbitMQ DLX
Kafka Manual DLQ topic
Node BullMQ
매 message critical 큰 retention + alert
Multi-tenant Per-tenant DLQ
LLM API Smart routing
Idempotent processing 매번 보장

안티패턴

  • DLQ 없음: poison pill = main queue 막힘.
  • No alert: silent failure 누적.
  • Infinite retry: queue 막힘.
  • Replay 없는 plan: DLQ 가 graveyard.
  • No idempotency: replay = 중복.
  • No error context: debug 불가.
  • No retention: data lose.

🤖 LLM 활용 힌트

  • DLQ = 처리 못한 message + alert + replay.
  • Smart routing (transient vs permanent).
  • Idempotent processing 필수.
  • Per-tenant DLQ 가 noisy neighbor.

🔗 관련 문서