---
id: wiki-2026-0508-batch-inference
title: Batch Inference
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [batch inference, async inference, dynamic batching, continuous batching, throughput optimization]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [inference, throughput, gpu, optimization, llm-serving, vllm, triton, ray]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: vLLM / Triton / Ray Serve / Modal
---

# Batch Inference

## 📌 한 줄 통찰
> **"매 GPU 의 공동 구매"**. 매 single request 의 즉시 응답 X — 매 batch 의 throughput 의 maximize. 매 LLM 의 dynamic / continuous batching 의 5-20× throughput. 매 cost / latency trade-off 의 가장 큰 lever.

## 📖 핵심

### 매 inference type
| 종류 | Latency | Throughput | Cost | 사례 |
|---|---|---|---|---|
| Online (sync) | <100ms | 매 low | 매 high | 매 chat, 매 search |
| Batch (offline) | minute~hour | 매 max | 매 lowest | 매 daily summary, 매 fraud scan |
| Async / queue | second~min | 매 mid | 매 mid | 매 image gen, 매 transcribe |

### 매 batching 의 종류

#### Static batching (전통)
- 매 batch size 의 fix.
- 매 batch 의 fill 의 wait → 매 latency variable.

#### Dynamic batching (Triton)
- 매 max wait time 의 limit.
- 매 incoming request 의 group.
- ✅ 매 latency / throughput balance.

#### Continuous batching (vLLM, TensorRT-LLM)
- 매 LLM 의 specific.
- 매 sequence 의 finish 의 다른 sequence 의 immediately fill.
- 매 GPU 의 idle 매 minimize.
- 매 5-20× throughput.

#### PagedAttention (vLLM)
- 매 KV cache 의 page table.
- 매 memory fragmentation 의 minimize.
- 매 long context + batch 의 enable.

### 매 batch size 의 effect
- **Throughput**: 매 batch ↑ → 매 GPU util ↑.
- **Latency** (per request): 매 wait ↑.
- **Memory**: 매 batch ↑ → 매 OOM risk.
- **Sweet spot**: 매 GPU memory + latency SLA 의 fit.

### 매 batch inference 의 적용
1. **Embedding generation**: 매 100M doc 의 batch.
2. **Summarization**: 매 daily news.
3. **Fraud detection**: 매 transaction 의 nightly.
4. **Recommendation**: 매 user-item score 의 precompute.
5. **Image classification** (archive): 매 medical image.
6. **Translation** (corpus): 매 doc bulk.

### Hybrid (modern LLM serving)
- 매 online (chat) + 매 batch (precompute) 의 mix.
- 매 priority queue 의 latency-sensitive 의 first.
- 매 streaming 의 progressive output.

### 매 monitoring
- **Throughput**: token/s, request/s.
- **Latency**: p50, p95, p99, TTFT (time to first token).
- **GPU util**: 매 70-90% target.
- **Batch size 의 distribution**.
- **Queue depth**.

## 💻 패턴

### vLLM offline batch
```python
from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
sampling = SamplingParams(temperature=0.7, max_tokens=512)

prompts = [...]  # 매 10K
outputs = llm.generate(prompts, sampling)
# 매 continuous batching 의 self-managed
```

### vLLM online server (continuous batching)
```bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.9
```

### Ray Batch inference
```python
import ray
import ray.data

ds = ray.data.read_parquet('s3://my-bucket/data/')

class Predictor:
    def __init__(self):
        self.model = load_model()
    def __call__(self, batch):
        return {'pred': self.model(batch['features'])}

predictions = ds.map_batches(Predictor, batch_size=64, num_gpus=1, concurrency=4)
predictions.write_parquet('s3://my-bucket/predictions/')
```

### Triton dynamic batching
```protobuf
# config.pbtxt
name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 64
dynamic_batching {
  max_queue_delay_microseconds: 5000  # 5ms
  preferred_batch_size: [16, 32, 64]
}
```

### Custom batching (asyncio queue)
```python
import asyncio
from collections import deque

class BatchQueue:
    def __init__(self, model, max_batch=32, max_wait_ms=10):
        self.model = model
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms
        self.queue: deque = deque()
        asyncio.create_task(self._loop())
    
    async def predict(self, x):
        future = asyncio.Future()
        self.queue.append((x, future))
        return await future
    
    async def _loop(self):
        while True:
            if not self.queue:
                await asyncio.sleep(0.001)
                continue
            await asyncio.sleep(self.max_wait_ms / 1000)
            batch = []
            while self.queue and len(batch) < self.max_batch:
                batch.append(self.queue.popleft())
            xs = [b[0] for b in batch]
            preds = self.model(xs)
            for (_, future), p in zip(batch, preds):
                future.set_result(p)
```

### Cost optimization (spot + batch)
```python
# 매 batch job 의 spot instance OK
# 매 1 hour SLA → 매 spot interrupt OK
config = {
    'instance_type': 'g5.2xlarge',
    'pricing': 'spot',  # ~70% cheaper
    'max_runtime_min': 60,
    'retry_on_interrupt': True,
}
```

## 🤔 결정 기준
| 상황 | Strategy |
|---|---|
| Chat / search | Continuous batching (vLLM) |
| Daily summary | Offline batch + spot |
| Embedding 100M doc | Ray + GPU batch |
| Image generation | Async queue + webhook |
| Fraud nightly | Batch + cheap GPU |
| RT API + bulk | Hybrid (priority queue) |

**기본값**: vLLM (LLM) / Triton (general) / Ray (distributed).

## 🔗 Graph
- 변형: [[Continuous-Batching]] · [[Dynamic-Batching]] · [[Static-Batching]]
- 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[LLM_Optimization_and_Deployment_Strategies|PagedAttention]]
- Adjacent: [[KV-Cache]] · [[LLM_Optimization_and_Deployment_Strategies|Inference-Optimization]]

## 🤖 LLM 활용
**언제**: 매 cost optimization. 매 throughput 우선 task. 매 LLM serving infra design.
**언제 X**: 매 strict <100ms latency. 매 online interactive (single request).

## ❌ 안티패턴
- **Online 의 batch 의 force**: 매 latency violate.
- **Static batch (LLM)**: 매 GPU idle.
- **Batch size 의 max 의 OOM**: 매 retry storm.
- **No max wait**: 매 indefinite delay.
- **No monitoring**: 매 GPU util 의 모름.
- **Spot 의 stateful job**: 매 interrupt 의 lose.

## 🧪 검증 / 중복
- Verified (vLLM paper, NVIDIA Triton, Ray).
- 신뢰도 A.
- Related: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Continuous-Batching]] · [[GPU-Utilization]] · [[ML-Inference]].

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — batching 종류 + vLLM + Triton + Ray + PagedAttention |