7.7 KiB
7.7 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ai-production-deploy | AI Production Deploy — vLLM / TGI / LangServe | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
AI Production Deploy
Local LLM serving = simple. vLLM (가장 빠른), TGI (HuggingFace), LangServe (LangChain), Modal. GPU + batching + cache.
📖 핵심 개념
- Inference engine: 매 token 의 cost.
- Batching = 큰 throughput.
- KV cache = context reuse.
- Quantization = memory ↓.
💻 코드 패턴
vLLM (가장 빠름)
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
prompts = ['Hello, ', 'The capital of France is ']
params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(prompts, params)
for o in outputs:
print(o.outputs[0].text)
vLLM API server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--host 0.0.0.0 --port 8000
# OpenAI-compatible
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "meta-llama/Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hi"}]}'
→ OpenAI API 호환. drop-in replacement.
vLLM 의 강점
- PagedAttention (KV cache 효율).
- Continuous batching.
- 24/7 serving 친화.
- 가장 빠름 (open source).
→ Production default.
Text Generation Inference (TGI)
docker run --gpus all -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference \
--model-id meta-llama/Llama-3-8B-Instruct
curl http://localhost:8080/generate \
-d '{"inputs": "Hi", "parameters": {"max_new_tokens": 100}}'
→ HuggingFace native. Inference Endpoints 의 backend.
Ollama (local dev)
ollama pull llama3
ollama run llama3 'Hello'
# OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
-d '{"model": "llama3", "messages": [...]}'
→ Local dev / 작은 use case. Production X.
LangServe (LangChain)
from langserve import add_routes
from fastapi import FastAPI
app = FastAPI()
add_routes(app, my_chain, path='/chain')
uvicorn main:app --host 0.0.0.0
→ LangChain 의 chain 가 REST endpoint.
BentoML
import bentoml
@bentoml.service
class LLMService:
model = bentoml.transformers.import_model('meta-llama/Llama-3-8B-Instruct')
@bentoml.api
def chat(self, prompt: str) -> str:
return self.model.generate(prompt)
bentoml serve service.py
bentoml containerize llm:latest
→ Docker 가 자동.
Modal (managed, GPU)
import modal
app = modal.App('llm')
image = modal.Image.debian_slim().pip_install('vllm')
@app.cls(gpu='A100', image=image)
class LLM:
@modal.enter()
def load(self):
from vllm import LLM
self.llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
@modal.method()
def generate(self, prompt: str):
return self.llm.generate([prompt])[0].outputs[0].text
→ Pay per GPU-second. Managed scaling.
Anyscale / Together / Replicate
Managed inference:
- Anyscale (Ray + vLLM).
- Together AI.
- Replicate.
- Hyperbolic.
→ Bring own model 또는 API.
→ Self-host 의 alternative.
Quantization
# 4-bit (GPTQ / AWQ / bitsandbytes)
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype='float16')
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=config)
→ Memory ↓ 4x. Quality 약간 ↓.
Llama-3-8B:
- FP16: 16 GB
- INT8: 8 GB
- INT4: 4 GB (single consumer GPU)
- INT2 (extreme): 2 GB (quality 큰 ↓)
llama.cpp (CPU / Mac)
# GGUF format
./llama-cli -m model.gguf -p 'Hello'
# Or server
./llama-server -m model.gguf --port 8080
→ Mac M1/M2/M3 친화. 작은 throughput.
vLLM tensor parallelism
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B \
--tensor-parallel-size 4
→ 큰 model 가 4 GPU 가 분산.
Speculative decoding (빠름)
# Larger model + smaller draft
llm = LLM(
model='meta-llama/Llama-3-70B',
speculative_model='meta-llama/Llama-3-8B',
)
→ 작은 model 가 draft, 큰 model 가 verify. 2-3x 빠름.
Continuous batching
naive: request 1 → process → request 2 → process.
Continuous: 매 step 가 다른 request 같이.
→ vLLM / TGI 가 default. GPU utilize ↑.
KV cache
매 token generation 가 attention compute.
이전 token 의 K, V 가 cache.
prompt 가 길음 = KV cache 큰 = memory 많이.
Prompt caching
같은 system prompt 가 반복.
- vLLM 의 prefix cache.
- Anthropic / OpenAI 의 prompt cache (90% cost ↓).
Multi-tenant
1 model + N user:
- Per-user 의 권한.
- Per-user 의 rate limit.
- Per-user 의 logging.
→ vLLM 가 single instance.
Per-user 격리 = gateway level.
Per-user model
User 별 fine-tuned model:
- LoRA adapter 만 다름.
- Base 가 share.
- vLLM 의 Multi-LoRA 지원.
vllm --enable-lora --lora-modules user1=path1 user2=path2
Cost
GPU rental:
- A100 80GB: $1-3 / hour.
- H100: $3-6 / hour.
자체 host:
- A100 server: $30k+ / 매월 amortize.
API:
- GPT-4o: $2.5 / MTok in, $10 / MTok out.
- Claude Opus: $15 / $75.
- Llama-3-8B (Together): $0.20 / MTok.
→ 큰 traffic = self-host 가능.
작은 / variable = API.
Latency target
Chat: < 2s first token, < 50 ms / token after.
Completion: 빠름 OK.
Search: < 200 ms (low latency model).
→ Model size + GPU + batching trade-off.
Streaming
# OpenAI-compatible
async with client.chat.completions.create(..., stream=True) as stream:
async for chunk in stream:
yield chunk.choices[0].delta.content
→ User-perceived latency ↓.
Failover
Primary: vLLM (가장 빠름).
Fallback: API (managed).
→ Primary down 시 fallback.
Eval at deploy
새 version deploy 전:
- Latency benchmark.
- Quality eval (golden set).
- Compare vs current.
→ Regression 방지.
Monitoring
- Latency (p50, p99).
- Throughput (tokens / sec).
- GPU utilization.
- Memory.
- Error rate.
- Cost / 1M token.
→ Datadog / Grafana / Helicone.
Scaling
Horizontal: 더 많은 instance.
Vertical: 더 큰 GPU.
Quantize: 작은 memory.
Cache: hit rate ↑.
→ Auto-scale (Modal / K8s + KEDA).
Production stack 예
Cloudflare Workers (gateway)
↓
Anthropic / OpenAI (API) — 90% traffic
↓ (failover or cost-sensitive)
Self-host vLLM (GPU cluster) — 10%
→ Mix.
🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 큰 traffic | vLLM cluster |
| 작은 / variable | API (Anthropic / OpenAI) |
| HuggingFace | TGI |
| Local dev | Ollama |
| Mac | llama.cpp |
| Managed self-host | Modal / Anyscale |
| LangChain | LangServe |
| Multi-LoRA | vLLM with LoRA |
❌ 안티패턴
- Production 가 Ollama: throughput 부족.
- No batching: GPU idle.
- No quantization (작은 GPU): OOM.
- Streaming 안 함: 사용자 wait.
- No prompt cache: cost 폭발.
- Single instance + no failover: down 시 crash.
- No eval at deploy: regression.
🤖 LLM 활용 힌트
- vLLM 가 open source 가장 빠름.
- TGI 가 HuggingFace native.
- Modal / Anyscale 가 managed self-host.
- API + self-host mix.