f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
218 lines
6.6 KiB
Markdown
218 lines
6.6 KiB
Markdown
---
|
||
id: wiki-2026-0508-batch-inference
|
||
title: Batch Inference
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [batch inference, async inference, dynamic batching, continuous batching, throughput optimization]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.93
|
||
verification_status: applied
|
||
tags: [inference, throughput, gpu, optimization, llm-serving, vllm, triton, ray]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python
|
||
framework: vLLM / Triton / Ray Serve / Modal
|
||
---
|
||
|
||
# Batch Inference
|
||
|
||
## 📌 한 줄 통찰
|
||
> **"매 GPU 의 공동 구매"**. 매 single request 의 즉시 응답 X — 매 batch 의 throughput 의 maximize. 매 LLM 의 dynamic / continuous batching 의 5-20× throughput. 매 cost / latency trade-off 의 가장 큰 lever.
|
||
|
||
## 📖 핵심
|
||
|
||
### 매 inference type
|
||
| 종류 | Latency | Throughput | Cost | 사례 |
|
||
|---|---|---|---|---|
|
||
| Online (sync) | <100ms | 매 low | 매 high | 매 chat, 매 search |
|
||
| Batch (offline) | minute~hour | 매 max | 매 lowest | 매 daily summary, 매 fraud scan |
|
||
| Async / queue | second~min | 매 mid | 매 mid | 매 image gen, 매 transcribe |
|
||
|
||
### 매 batching 의 종류
|
||
|
||
#### Static batching (전통)
|
||
- 매 batch size 의 fix.
|
||
- 매 batch 의 fill 의 wait → 매 latency variable.
|
||
|
||
#### Dynamic batching (Triton)
|
||
- 매 max wait time 의 limit.
|
||
- 매 incoming request 의 group.
|
||
- ✅ 매 latency / throughput balance.
|
||
|
||
#### Continuous batching (vLLM, TensorRT-LLM)
|
||
- 매 LLM 의 specific.
|
||
- 매 sequence 의 finish 의 다른 sequence 의 immediately fill.
|
||
- 매 GPU 의 idle 매 minimize.
|
||
- 매 5-20× throughput.
|
||
|
||
#### PagedAttention (vLLM)
|
||
- 매 KV cache 의 page table.
|
||
- 매 memory fragmentation 의 minimize.
|
||
- 매 long context + batch 의 enable.
|
||
|
||
### 매 batch size 의 effect
|
||
- **Throughput**: 매 batch ↑ → 매 GPU util ↑.
|
||
- **Latency** (per request): 매 wait ↑.
|
||
- **Memory**: 매 batch ↑ → 매 OOM risk.
|
||
- **Sweet spot**: 매 GPU memory + latency SLA 의 fit.
|
||
|
||
### 매 batch inference 의 적용
|
||
1. **Embedding generation**: 매 100M doc 의 batch.
|
||
2. **Summarization**: 매 daily news.
|
||
3. **Fraud detection**: 매 transaction 의 nightly.
|
||
4. **Recommendation**: 매 user-item score 의 precompute.
|
||
5. **Image classification** (archive): 매 medical image.
|
||
6. **Translation** (corpus): 매 doc bulk.
|
||
|
||
### Hybrid (modern LLM serving)
|
||
- 매 online (chat) + 매 batch (precompute) 의 mix.
|
||
- 매 priority queue 의 latency-sensitive 의 first.
|
||
- 매 streaming 의 progressive output.
|
||
|
||
### 매 monitoring
|
||
- **Throughput**: token/s, request/s.
|
||
- **Latency**: p50, p95, p99, TTFT (time to first token).
|
||
- **GPU util**: 매 70-90% target.
|
||
- **Batch size 의 distribution**.
|
||
- **Queue depth**.
|
||
|
||
## 💻 패턴
|
||
|
||
### vLLM offline batch
|
||
```python
|
||
from vllm import LLM, SamplingParams
|
||
|
||
llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
|
||
sampling = SamplingParams(temperature=0.7, max_tokens=512)
|
||
|
||
prompts = [...] # 매 10K
|
||
outputs = llm.generate(prompts, sampling)
|
||
# 매 continuous batching 의 self-managed
|
||
```
|
||
|
||
### vLLM online server (continuous batching)
|
||
```bash
|
||
python -m vllm.entrypoints.openai.api_server \
|
||
--model meta-llama/Llama-3-8B-Instruct \
|
||
--max-num-seqs 256 \
|
||
--gpu-memory-utilization 0.9
|
||
```
|
||
|
||
### Ray Batch inference
|
||
```python
|
||
import ray
|
||
import ray.data
|
||
|
||
ds = ray.data.read_parquet('s3://my-bucket/data/')
|
||
|
||
class Predictor:
|
||
def __init__(self):
|
||
self.model = load_model()
|
||
def __call__(self, batch):
|
||
return {'pred': self.model(batch['features'])}
|
||
|
||
predictions = ds.map_batches(Predictor, batch_size=64, num_gpus=1, concurrency=4)
|
||
predictions.write_parquet('s3://my-bucket/predictions/')
|
||
```
|
||
|
||
### Triton dynamic batching
|
||
```protobuf
|
||
# config.pbtxt
|
||
name: "my_model"
|
||
platform: "onnxruntime_onnx"
|
||
max_batch_size: 64
|
||
dynamic_batching {
|
||
max_queue_delay_microseconds: 5000 # 5ms
|
||
preferred_batch_size: [16, 32, 64]
|
||
}
|
||
```
|
||
|
||
### Custom batching (asyncio queue)
|
||
```python
|
||
import asyncio
|
||
from collections import deque
|
||
|
||
class BatchQueue:
|
||
def __init__(self, model, max_batch=32, max_wait_ms=10):
|
||
self.model = model
|
||
self.max_batch = max_batch
|
||
self.max_wait_ms = max_wait_ms
|
||
self.queue: deque = deque()
|
||
asyncio.create_task(self._loop())
|
||
|
||
async def predict(self, x):
|
||
future = asyncio.Future()
|
||
self.queue.append((x, future))
|
||
return await future
|
||
|
||
async def _loop(self):
|
||
while True:
|
||
if not self.queue:
|
||
await asyncio.sleep(0.001)
|
||
continue
|
||
await asyncio.sleep(self.max_wait_ms / 1000)
|
||
batch = []
|
||
while self.queue and len(batch) < self.max_batch:
|
||
batch.append(self.queue.popleft())
|
||
xs = [b[0] for b in batch]
|
||
preds = self.model(xs)
|
||
for (_, future), p in zip(batch, preds):
|
||
future.set_result(p)
|
||
```
|
||
|
||
### Cost optimization (spot + batch)
|
||
```python
|
||
# 매 batch job 의 spot instance OK
|
||
# 매 1 hour SLA → 매 spot interrupt OK
|
||
config = {
|
||
'instance_type': 'g5.2xlarge',
|
||
'pricing': 'spot', # ~70% cheaper
|
||
'max_runtime_min': 60,
|
||
'retry_on_interrupt': True,
|
||
}
|
||
```
|
||
|
||
## 🤔 결정 기준
|
||
| 상황 | Strategy |
|
||
|---|---|
|
||
| Chat / search | Continuous batching (vLLM) |
|
||
| Daily summary | Offline batch + spot |
|
||
| Embedding 100M doc | Ray + GPU batch |
|
||
| Image generation | Async queue + webhook |
|
||
| Fraud nightly | Batch + cheap GPU |
|
||
| RT API + bulk | Hybrid (priority queue) |
|
||
|
||
**기본값**: vLLM (LLM) / Triton (general) / Ray (distributed).
|
||
|
||
## 🔗 Graph
|
||
- 변형: [[Continuous-Batching]] · [[Dynamic-Batching]] · [[Static-Batching]]
|
||
- 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[LLM_Optimization_and_Deployment_Strategies|PagedAttention]]
|
||
- Adjacent: [[KV-Cache]] · [[LLM_Optimization_and_Deployment_Strategies|Inference-Optimization]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 cost optimization. 매 throughput 우선 task. 매 LLM serving infra design.
|
||
**언제 X**: 매 strict <100ms latency. 매 online interactive (single request).
|
||
|
||
## ❌ 안티패턴
|
||
- **Online 의 batch 의 force**: 매 latency violate.
|
||
- **Static batch (LLM)**: 매 GPU idle.
|
||
- **Batch size 의 max 의 OOM**: 매 retry storm.
|
||
- **No max wait**: 매 indefinite delay.
|
||
- **No monitoring**: 매 GPU util 의 모름.
|
||
- **Spot 의 stateful job**: 매 interrupt 의 lose.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (vLLM paper, NVIDIA Triton, Ray).
|
||
- 신뢰도 A.
|
||
- Related: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Continuous-Batching]] · [[GPU-Utilization]] · [[ML-Inference]].
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — batching 종류 + vLLM + Triton + Ray + PagedAttention |
|