Files
2nd/10_Wiki/Topics/AI_and_ML/Batch-Inference.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.6 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-batch-inference Batch Inference 10_Wiki/Topics verified self
batch inference
async inference
dynamic batching
continuous batching
throughput optimization
none A 0.93 applied
inference
throughput
gpu
optimization
llm-serving
vllm
triton
ray
2026-05-10 pending
language framework
Python vLLM / Triton / Ray Serve / Modal

Batch Inference

📌 한 줄 통찰

"매 GPU 의 공동 구매". 매 single request 의 즉시 응답 X — 매 batch 의 throughput 의 maximize. 매 LLM 의 dynamic / continuous batching 의 5-20× throughput. 매 cost / latency trade-off 의 가장 큰 lever.

📖 핵심

매 inference type

종류 Latency Throughput Cost 사례
Online (sync) <100ms 매 low 매 high 매 chat, 매 search
Batch (offline) minute~hour 매 max 매 lowest 매 daily summary, 매 fraud scan
Async / queue second~min 매 mid 매 mid 매 image gen, 매 transcribe

매 batching 의 종류

Static batching (전통)

  • 매 batch size 의 fix.
  • 매 batch 의 fill 의 wait → 매 latency variable.

Dynamic batching (Triton)

  • 매 max wait time 의 limit.
  • 매 incoming request 의 group.
  • 매 latency / throughput balance.

Continuous batching (vLLM, TensorRT-LLM)

  • 매 LLM 의 specific.
  • 매 sequence 의 finish 의 다른 sequence 의 immediately fill.
  • 매 GPU 의 idle 매 minimize.
  • 매 5-20× throughput.

PagedAttention (vLLM)

  • 매 KV cache 의 page table.
  • 매 memory fragmentation 의 minimize.
  • 매 long context + batch 의 enable.

매 batch size 의 effect

  • Throughput: 매 batch ↑ → 매 GPU util ↑.
  • Latency (per request): 매 wait ↑.
  • Memory: 매 batch ↑ → 매 OOM risk.
  • Sweet spot: 매 GPU memory + latency SLA 의 fit.

매 batch inference 의 적용

  1. Embedding generation: 매 100M doc 의 batch.
  2. Summarization: 매 daily news.
  3. Fraud detection: 매 transaction 의 nightly.
  4. Recommendation: 매 user-item score 의 precompute.
  5. Image classification (archive): 매 medical image.
  6. Translation (corpus): 매 doc bulk.

Hybrid (modern LLM serving)

  • 매 online (chat) + 매 batch (precompute) 의 mix.
  • 매 priority queue 의 latency-sensitive 의 first.
  • 매 streaming 의 progressive output.

매 monitoring

  • Throughput: token/s, request/s.
  • Latency: p50, p95, p99, TTFT (time to first token).
  • GPU util: 매 70-90% target.
  • Batch size 의 distribution.
  • Queue depth.

💻 패턴

vLLM offline batch

from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
sampling = SamplingParams(temperature=0.7, max_tokens=512)

prompts = [...]  # 매 10K
outputs = llm.generate(prompts, sampling)
# 매 continuous batching 의 self-managed

vLLM online server (continuous batching)

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.9

Ray Batch inference

import ray
import ray.data

ds = ray.data.read_parquet('s3://my-bucket/data/')

class Predictor:
    def __init__(self):
        self.model = load_model()
    def __call__(self, batch):
        return {'pred': self.model(batch['features'])}

predictions = ds.map_batches(Predictor, batch_size=64, num_gpus=1, concurrency=4)
predictions.write_parquet('s3://my-bucket/predictions/')

Triton dynamic batching

# config.pbtxt
name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 64
dynamic_batching {
  max_queue_delay_microseconds: 5000  # 5ms
  preferred_batch_size: [16, 32, 64]
}

Custom batching (asyncio queue)

import asyncio
from collections import deque

class BatchQueue:
    def __init__(self, model, max_batch=32, max_wait_ms=10):
        self.model = model
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms
        self.queue: deque = deque()
        asyncio.create_task(self._loop())
    
    async def predict(self, x):
        future = asyncio.Future()
        self.queue.append((x, future))
        return await future
    
    async def _loop(self):
        while True:
            if not self.queue:
                await asyncio.sleep(0.001)
                continue
            await asyncio.sleep(self.max_wait_ms / 1000)
            batch = []
            while self.queue and len(batch) < self.max_batch:
                batch.append(self.queue.popleft())
            xs = [b[0] for b in batch]
            preds = self.model(xs)
            for (_, future), p in zip(batch, preds):
                future.set_result(p)

Cost optimization (spot + batch)

# 매 batch job 의 spot instance OK
# 매 1 hour SLA → 매 spot interrupt OK
config = {
    'instance_type': 'g5.2xlarge',
    'pricing': 'spot',  # ~70% cheaper
    'max_runtime_min': 60,
    'retry_on_interrupt': True,
}

🤔 결정 기준

상황 Strategy
Chat / search Continuous batching (vLLM)
Daily summary Offline batch + spot
Embedding 100M doc Ray + GPU batch
Image generation Async queue + webhook
Fraud nightly Batch + cheap GPU
RT API + bulk Hybrid (priority queue)

기본값: vLLM (LLM) / Triton (general) / Ray (distributed).

🔗 Graph

🤖 LLM 활용

언제: 매 cost optimization. 매 throughput 우선 task. 매 LLM serving infra design. 언제 X: 매 strict <100ms latency. 매 online interactive (single request).

안티패턴

  • Online 의 batch 의 force: 매 latency violate.
  • Static batch (LLM): 매 GPU idle.
  • Batch size 의 max 의 OOM: 매 retry storm.
  • No max wait: 매 indefinite delay.
  • No monitoring: 매 GPU util 의 모름.
  • Spot 의 stateful job: 매 interrupt 의 lose.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — batching 종류 + vLLM + Triton + Ray + PagedAttention