Files
2nd/10_Wiki/Topics/AI_and_ML/Batch-Inference.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

218 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-batch-inference
title: Batch Inference
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [batch inference, async inference, dynamic batching, continuous batching, throughput optimization]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [inference, throughput, gpu, optimization, llm-serving, vllm, triton, ray]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: vLLM / Triton / Ray Serve / Modal
---
# Batch Inference
## 📌 한 줄 통찰
> **"매 GPU 의 공동 구매"**. 매 single request 의 즉시 응답 X — 매 batch 의 throughput 의 maximize. 매 LLM 의 dynamic / continuous batching 의 5-20× throughput. 매 cost / latency trade-off 의 가장 큰 lever.
## 📖 핵심
### 매 inference type
| 종류 | Latency | Throughput | Cost | 사례 |
|---|---|---|---|---|
| Online (sync) | <100ms | 매 low | 매 high | 매 chat, 매 search |
| Batch (offline) | minute~hour | 매 max | 매 lowest | 매 daily summary, 매 fraud scan |
| Async / queue | second~min | 매 mid | 매 mid | 매 image gen, 매 transcribe |
### 매 batching 의 종류
#### Static batching (전통)
- 매 batch size 의 fix.
- 매 batch 의 fill 의 wait → 매 latency variable.
#### Dynamic batching (Triton)
- 매 max wait time 의 limit.
- 매 incoming request 의 group.
- ✅ 매 latency / throughput balance.
#### Continuous batching (vLLM, TensorRT-LLM)
- 매 LLM 의 specific.
- 매 sequence 의 finish 의 다른 sequence 의 immediately fill.
- 매 GPU 의 idle 매 minimize.
- 매 5-20× throughput.
#### PagedAttention (vLLM)
- 매 KV cache 의 page table.
- 매 memory fragmentation 의 minimize.
- 매 long context + batch 의 enable.
### 매 batch size 의 effect
- **Throughput**: 매 batch ↑ → 매 GPU util ↑.
- **Latency** (per request): 매 wait ↑.
- **Memory**: 매 batch ↑ → 매 OOM risk.
- **Sweet spot**: 매 GPU memory + latency SLA 의 fit.
### 매 batch inference 의 적용
1. **Embedding generation**: 매 100M doc 의 batch.
2. **Summarization**: 매 daily news.
3. **Fraud detection**: 매 transaction 의 nightly.
4. **Recommendation**: 매 user-item score 의 precompute.
5. **Image classification** (archive): 매 medical image.
6. **Translation** (corpus): 매 doc bulk.
### Hybrid (modern LLM serving)
- 매 online (chat) + 매 batch (precompute) 의 mix.
- 매 priority queue 의 latency-sensitive 의 first.
- 매 streaming 의 progressive output.
### 매 monitoring
- **Throughput**: token/s, request/s.
- **Latency**: p50, p95, p99, TTFT (time to first token).
- **GPU util**: 매 70-90% target.
- **Batch size 의 distribution**.
- **Queue depth**.
## 💻 패턴
### vLLM offline batch
```python
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
sampling = SamplingParams(temperature=0.7, max_tokens=512)
prompts = [...] # 매 10K
outputs = llm.generate(prompts, sampling)
# 매 continuous batching 의 self-managed
```
### vLLM online server (continuous batching)
```bash
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--max-num-seqs 256 \
--gpu-memory-utilization 0.9
```
### Ray Batch inference
```python
import ray
import ray.data
ds = ray.data.read_parquet('s3://my-bucket/data/')
class Predictor:
def __init__(self):
self.model = load_model()
def __call__(self, batch):
return {'pred': self.model(batch['features'])}
predictions = ds.map_batches(Predictor, batch_size=64, num_gpus=1, concurrency=4)
predictions.write_parquet('s3://my-bucket/predictions/')
```
### Triton dynamic batching
```protobuf
# config.pbtxt
name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 64
dynamic_batching {
max_queue_delay_microseconds: 5000 # 5ms
preferred_batch_size: [16, 32, 64]
}
```
### Custom batching (asyncio queue)
```python
import asyncio
from collections import deque
class BatchQueue:
def __init__(self, model, max_batch=32, max_wait_ms=10):
self.model = model
self.max_batch = max_batch
self.max_wait_ms = max_wait_ms
self.queue: deque = deque()
asyncio.create_task(self._loop())
async def predict(self, x):
future = asyncio.Future()
self.queue.append((x, future))
return await future
async def _loop(self):
while True:
if not self.queue:
await asyncio.sleep(0.001)
continue
await asyncio.sleep(self.max_wait_ms / 1000)
batch = []
while self.queue and len(batch) < self.max_batch:
batch.append(self.queue.popleft())
xs = [b[0] for b in batch]
preds = self.model(xs)
for (_, future), p in zip(batch, preds):
future.set_result(p)
```
### Cost optimization (spot + batch)
```python
# 매 batch job 의 spot instance OK
# 매 1 hour SLA → 매 spot interrupt OK
config = {
'instance_type': 'g5.2xlarge',
'pricing': 'spot', # ~70% cheaper
'max_runtime_min': 60,
'retry_on_interrupt': True,
}
```
## 🤔 결정 기준
| 상황 | Strategy |
|---|---|
| Chat / search | Continuous batching (vLLM) |
| Daily summary | Offline batch + spot |
| Embedding 100M doc | Ray + GPU batch |
| Image generation | Async queue + webhook |
| Fraud nightly | Batch + cheap GPU |
| RT API + bulk | Hybrid (priority queue) |
**기본값**: vLLM (LLM) / Triton (general) / Ray (distributed).
## 🔗 Graph
- 변형: [[Continuous-Batching]] · [[Dynamic-Batching]] · [[Static-Batching]]
- 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[LLM_Optimization_and_Deployment_Strategies|PagedAttention]]
- Adjacent: [[KV-Cache]] · [[LLM_Optimization_and_Deployment_Strategies|Inference-Optimization]]
## 🤖 LLM 활용
**언제**: 매 cost optimization. 매 throughput 우선 task. 매 LLM serving infra design.
**언제 X**: 매 strict <100ms latency. 매 online interactive (single request).
## ❌ 안티패턴
- **Online 의 batch 의 force**: 매 latency violate.
- **Static batch (LLM)**: 매 GPU idle.
- **Batch size 의 max 의 OOM**: 매 retry storm.
- **No max wait**: 매 indefinite delay.
- **No monitoring**: 매 GPU util 의 모름.
- **Spot 의 stateful job**: 매 interrupt 의 lose.
## 🧪 검증 / 중복
- Verified (vLLM paper, NVIDIA Triton, Ray).
- 신뢰도 A.
- Related: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Continuous-Batching]] · [[GPU-Utilization]] · [[ML-Inference]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — batching 종류 + vLLM + Triton + Ray + PagedAttention |