Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

6.6 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Batch Inference

📌 한 줄 통찰

"매 GPU 의 공동 구매". 매 single request 의 즉시 응답 X — 매 batch 의 throughput 의 maximize. 매 LLM 의 dynamic / continuous batching 의 5-20× throughput. 매 cost / latency trade-off 의 가장 큰 lever.

📖 핵심

매 inference type

종류	Latency	Throughput	Cost	사례
Online (sync)	<100ms	매 low	매 high	매 chat, 매 search
Batch (offline)	minute~hour	매 max	매 lowest	매 daily summary, 매 fraud scan
Async / queue	second~min	매 mid	매 mid	매 image gen, 매 transcribe

매 batching 의 종류

Static batching (전통)

매 batch size 의 fix.
매 batch 의 fill 의 wait → 매 latency variable.

Dynamic batching (Triton)

매 max wait time 의 limit.
매 incoming request 의 group.
✅ 매 latency / throughput balance.

Continuous batching (vLLM, TensorRT-LLM)

매 LLM 의 specific.
매 sequence 의 finish 의 다른 sequence 의 immediately fill.
매 GPU 의 idle 매 minimize.
매 5-20× throughput.

PagedAttention (vLLM)

매 KV cache 의 page table.
매 memory fragmentation 의 minimize.
매 long context + batch 의 enable.

매 batch size 의 effect

Throughput: 매 batch ↑ → 매 GPU util ↑.
Latency (per request): 매 wait ↑.
Memory: 매 batch ↑ → 매 OOM risk.
Sweet spot: 매 GPU memory + latency SLA 의 fit.

매 batch inference 의 적용

Embedding generation: 매 100M doc 의 batch.
Summarization: 매 daily news.
Fraud detection: 매 transaction 의 nightly.
Recommendation: 매 user-item score 의 precompute.
Image classification (archive): 매 medical image.
Translation (corpus): 매 doc bulk.

Hybrid (modern LLM serving)

매 online (chat) + 매 batch (precompute) 의 mix.
매 priority queue 의 latency-sensitive 의 first.
매 streaming 의 progressive output.

매 monitoring

Throughput: token/s, request/s.
Latency: p50, p95, p99, TTFT (time to first token).
GPU util: 매 70-90% target.
Batch size 의 distribution.
Queue depth.

💻 패턴

vLLM offline batch

from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
sampling = SamplingParams(temperature=0.7, max_tokens=512)

prompts = [...]  # 매 10K
outputs = llm.generate(prompts, sampling)
# 매 continuous batching 의 self-managed

vLLM online server (continuous batching)

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.9

Ray Batch inference

import ray
import ray.data

ds = ray.data.read_parquet('s3://my-bucket/data/')

class Predictor:
    def __init__(self):
        self.model = load_model()
    def __call__(self, batch):
        return {'pred': self.model(batch['features'])}

predictions = ds.map_batches(Predictor, batch_size=64, num_gpus=1, concurrency=4)
predictions.write_parquet('s3://my-bucket/predictions/')

Triton dynamic batching

# config.pbtxt
name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 64
dynamic_batching {
  max_queue_delay_microseconds: 5000  # 5ms
  preferred_batch_size: [16, 32, 64]
}

Custom batching (asyncio queue)

import asyncio
from collections import deque

class BatchQueue:
    def __init__(self, model, max_batch=32, max_wait_ms=10):
        self.model = model
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms
        self.queue: deque = deque()
        asyncio.create_task(self._loop())
    
    async def predict(self, x):
        future = asyncio.Future()
        self.queue.append((x, future))
        return await future
    
    async def _loop(self):
        while True:
            if not self.queue:
                await asyncio.sleep(0.001)
                continue
            await asyncio.sleep(self.max_wait_ms / 1000)
            batch = []
            while self.queue and len(batch) < self.max_batch:
                batch.append(self.queue.popleft())
            xs = [b[0] for b in batch]
            preds = self.model(xs)
            for (_, future), p in zip(batch, preds):
                future.set_result(p)

Cost optimization (spot + batch)

# 매 batch job 의 spot instance OK
# 매 1 hour SLA → 매 spot interrupt OK
config = {
    'instance_type': 'g5.2xlarge',
    'pricing': 'spot',  # ~70% cheaper
    'max_runtime_min': 60,
    'retry_on_interrupt': True,
}

🤔 결정 기준

상황	Strategy
Chat / search	Continuous batching (vLLM)
Daily summary	Offline batch + spot
Embedding 100M doc	Ray + GPU batch
Image generation	Async queue + webhook
Fraud nightly	Batch + cheap GPU
RT API + bulk	Hybrid (priority queue)

기본값: vLLM (LLM) / Triton (general) / Ray (distributed).

🔗 Graph

변형: Continuous-Batching · Dynamic-Batching · Static-Batching
응용: LLM_Optimization_and_Deployment_Strategies · LLM_Optimization_and_Deployment_Strategies
Adjacent: KV-Cache · LLM_Optimization_and_Deployment_Strategies

🤖 LLM 활용

언제: 매 cost optimization. 매 throughput 우선 task. 매 LLM serving infra design. 언제 X: 매 strict <100ms latency. 매 online interactive (single request).

❌ 안티패턴

Online 의 batch 의 force: 매 latency violate.
Static batch (LLM): 매 GPU idle.
Batch size 의 max 의 OOM: 매 retry storm.
No max wait: 매 indefinite delay.
No monitoring: 매 GPU util 의 모름.
Spot 의 stateful job: 매 interrupt 의 lose.

🧪 검증 / 중복

Verified (vLLM paper, NVIDIA Triton, Ray).
신뢰도 A.
Related: LLM_Optimization_and_Deployment_Strategies · Continuous-Batching · GPU-Utilization · ML-Inference.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — batching 종류 + vLLM + Triton + Ray + PagedAttention

6.6 KiB Raw Blame History Unescape Escape