--- id: wiki-2026-0508-batch-inference title: Batch Inference category: 10_Wiki/Topics status: verified canonical_id: self aliases: [batch inference, async inference, dynamic batching, continuous batching, throughput optimization] duplicate_of: none source_trust_level: A confidence_score: 0.93 verification_status: applied tags: [inference, throughput, gpu, optimization, llm-serving, vllm, triton, ray] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: vLLM / Triton / Ray Serve / Modal --- # Batch Inference ## 📌 한 줄 통찰 > **"매 GPU 의 공동 구매"**. 매 single request 의 즉시 응답 X — 매 batch 의 throughput 의 maximize. 매 LLM 의 dynamic / continuous batching 의 5-20× throughput. 매 cost / latency trade-off 의 가장 큰 lever. ## 📖 핵심 ### 매 inference type | 종류 | Latency | Throughput | Cost | 사례 | |---|---|---|---|---| | Online (sync) | <100ms | 매 low | 매 high | 매 chat, 매 search | | Batch (offline) | minute~hour | 매 max | 매 lowest | 매 daily summary, 매 fraud scan | | Async / queue | second~min | 매 mid | 매 mid | 매 image gen, 매 transcribe | ### 매 batching 의 종류 #### Static batching (전통) - 매 batch size 의 fix. - 매 batch 의 fill 의 wait → 매 latency variable. #### Dynamic batching (Triton) - 매 max wait time 의 limit. - 매 incoming request 의 group. - ✅ 매 latency / throughput balance. #### Continuous batching (vLLM, TensorRT-LLM) - 매 LLM 의 specific. - 매 sequence 의 finish 의 다른 sequence 의 immediately fill. - 매 GPU 의 idle 매 minimize. - 매 5-20× throughput. #### PagedAttention (vLLM) - 매 KV cache 의 page table. - 매 memory fragmentation 의 minimize. - 매 long context + batch 의 enable. ### 매 batch size 의 effect - **Throughput**: 매 batch ↑ → 매 GPU util ↑. - **Latency** (per request): 매 wait ↑. - **Memory**: 매 batch ↑ → 매 OOM risk. - **Sweet spot**: 매 GPU memory + latency SLA 의 fit. ### 매 batch inference 의 적용 1. **Embedding generation**: 매 100M doc 의 batch. 2. **Summarization**: 매 daily news. 3. **Fraud detection**: 매 transaction 의 nightly. 4. **Recommendation**: 매 user-item score 의 precompute. 5. **Image classification** (archive): 매 medical image. 6. **Translation** (corpus): 매 doc bulk. ### Hybrid (modern LLM serving) - 매 online (chat) + 매 batch (precompute) 의 mix. - 매 priority queue 의 latency-sensitive 의 first. - 매 streaming 의 progressive output. ### 매 monitoring - **Throughput**: token/s, request/s. - **Latency**: p50, p95, p99, TTFT (time to first token). - **GPU util**: 매 70-90% target. - **Batch size 의 distribution**. - **Queue depth**. ## 💻 패턴 ### vLLM offline batch ```python from vllm import LLM, SamplingParams llm = LLM(model='meta-llama/Llama-3-8B-Instruct') sampling = SamplingParams(temperature=0.7, max_tokens=512) prompts = [...] # 매 10K outputs = llm.generate(prompts, sampling) # 매 continuous batching 의 self-managed ``` ### vLLM online server (continuous batching) ```bash python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3-8B-Instruct \ --max-num-seqs 256 \ --gpu-memory-utilization 0.9 ``` ### Ray Batch inference ```python import ray import ray.data ds = ray.data.read_parquet('s3://my-bucket/data/') class Predictor: def __init__(self): self.model = load_model() def __call__(self, batch): return {'pred': self.model(batch['features'])} predictions = ds.map_batches(Predictor, batch_size=64, num_gpus=1, concurrency=4) predictions.write_parquet('s3://my-bucket/predictions/') ``` ### Triton dynamic batching ```protobuf # config.pbtxt name: "my_model" platform: "onnxruntime_onnx" max_batch_size: 64 dynamic_batching { max_queue_delay_microseconds: 5000 # 5ms preferred_batch_size: [16, 32, 64] } ``` ### Custom batching (asyncio queue) ```python import asyncio from collections import deque class BatchQueue: def __init__(self, model, max_batch=32, max_wait_ms=10): self.model = model self.max_batch = max_batch self.max_wait_ms = max_wait_ms self.queue: deque = deque() asyncio.create_task(self._loop()) async def predict(self, x): future = asyncio.Future() self.queue.append((x, future)) return await future async def _loop(self): while True: if not self.queue: await asyncio.sleep(0.001) continue await asyncio.sleep(self.max_wait_ms / 1000) batch = [] while self.queue and len(batch) < self.max_batch: batch.append(self.queue.popleft()) xs = [b[0] for b in batch] preds = self.model(xs) for (_, future), p in zip(batch, preds): future.set_result(p) ``` ### Cost optimization (spot + batch) ```python # 매 batch job 의 spot instance OK # 매 1 hour SLA → 매 spot interrupt OK config = { 'instance_type': 'g5.2xlarge', 'pricing': 'spot', # ~70% cheaper 'max_runtime_min': 60, 'retry_on_interrupt': True, } ``` ## 🤔 결정 기준 | 상황 | Strategy | |---|---| | Chat / search | Continuous batching (vLLM) | | Daily summary | Offline batch + spot | | Embedding 100M doc | Ray + GPU batch | | Image generation | Async queue + webhook | | Fraud nightly | Batch + cheap GPU | | RT API + bulk | Hybrid (priority queue) | **기본값**: vLLM (LLM) / Triton (general) / Ray (distributed). ## 🔗 Graph - 변형: [[Continuous-Batching]] · [[Dynamic-Batching]] · [[Static-Batching]] - 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[LLM_Optimization_and_Deployment_Strategies|PagedAttention]] - Adjacent: [[KV-Cache]] · [[LLM_Optimization_and_Deployment_Strategies|Inference-Optimization]] ## 🤖 LLM 활용 **언제**: 매 cost optimization. 매 throughput 우선 task. 매 LLM serving infra design. **언제 X**: 매 strict <100ms latency. 매 online interactive (single request). ## ❌ 안티패턴 - **Online 의 batch 의 force**: 매 latency violate. - **Static batch (LLM)**: 매 GPU idle. - **Batch size 의 max 의 OOM**: 매 retry storm. - **No max wait**: 매 indefinite delay. - **No monitoring**: 매 GPU util 의 모름. - **Spot 의 stateful job**: 매 interrupt 의 lose. ## 🧪 검증 / 중복 - Verified (vLLM paper, NVIDIA Triton, Ray). - 신뢰도 A. - Related: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] · [[Continuous-Batching]] · [[GPU-Utilization]] · [[ML-Inference]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — batching 종류 + vLLM + Triton + Ray + PagedAttention |