[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,380 @@
|
||||
---
|
||||
id: ai-production-deploy
|
||||
title: AI Production Deploy — vLLM / TGI / LangServe
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, deploy, vibe-coding]
|
||||
tech_stack: { language: "Python", applicable_to: ["AI"] }
|
||||
applied_in: []
|
||||
aliases: [vLLM, TGI, Text Generation Inference, LangServe, BentoML, GPU inference, model serving]
|
||||
---
|
||||
|
||||
# AI Production Deploy
|
||||
|
||||
> Local LLM serving = simple. **vLLM (가장 빠른), TGI (HuggingFace), LangServe (LangChain), Modal**. GPU + batching + cache.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Inference engine: 매 token 의 cost.
|
||||
- Batching = 큰 throughput.
|
||||
- KV cache = context reuse.
|
||||
- Quantization = memory ↓.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### vLLM (가장 빠름)
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
|
||||
|
||||
prompts = ['Hello, ', 'The capital of France is ']
|
||||
params = SamplingParams(temperature=0.8, max_tokens=100)
|
||||
|
||||
outputs = llm.generate(prompts, params)
|
||||
for o in outputs:
|
||||
print(o.outputs[0].text)
|
||||
```
|
||||
|
||||
### vLLM API server
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model meta-llama/Llama-3-8B-Instruct \
|
||||
--host 0.0.0.0 --port 8000
|
||||
```
|
||||
|
||||
```bash
|
||||
# OpenAI-compatible
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"model": "meta-llama/Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hi"}]}'
|
||||
```
|
||||
|
||||
→ OpenAI API 호환. drop-in replacement.
|
||||
|
||||
### vLLM 의 강점
|
||||
```
|
||||
- PagedAttention (KV cache 효율).
|
||||
- Continuous batching.
|
||||
- 24/7 serving 친화.
|
||||
- 가장 빠름 (open source).
|
||||
|
||||
→ Production default.
|
||||
```
|
||||
|
||||
### Text Generation Inference (TGI)
|
||||
```bash
|
||||
docker run --gpus all -p 8080:80 \
|
||||
-v /data:/data \
|
||||
ghcr.io/huggingface/text-generation-inference \
|
||||
--model-id meta-llama/Llama-3-8B-Instruct
|
||||
```
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/generate \
|
||||
-d '{"inputs": "Hi", "parameters": {"max_new_tokens": 100}}'
|
||||
```
|
||||
|
||||
→ HuggingFace native. Inference Endpoints 의 backend.
|
||||
|
||||
### Ollama (local dev)
|
||||
```bash
|
||||
ollama pull llama3
|
||||
ollama run llama3 'Hello'
|
||||
```
|
||||
|
||||
```bash
|
||||
# OpenAI-compatible API
|
||||
curl http://localhost:11434/v1/chat/completions \
|
||||
-d '{"model": "llama3", "messages": [...]}'
|
||||
```
|
||||
|
||||
→ Local dev / 작은 use case. Production X.
|
||||
|
||||
### LangServe (LangChain)
|
||||
```python
|
||||
from langserve import add_routes
|
||||
from fastapi import FastAPI
|
||||
|
||||
app = FastAPI()
|
||||
add_routes(app, my_chain, path='/chain')
|
||||
```
|
||||
|
||||
```bash
|
||||
uvicorn main:app --host 0.0.0.0
|
||||
```
|
||||
|
||||
→ LangChain 의 chain 가 REST endpoint.
|
||||
|
||||
### BentoML
|
||||
```python
|
||||
import bentoml
|
||||
|
||||
@bentoml.service
|
||||
class LLMService:
|
||||
model = bentoml.transformers.import_model('meta-llama/Llama-3-8B-Instruct')
|
||||
|
||||
@bentoml.api
|
||||
def chat(self, prompt: str) -> str:
|
||||
return self.model.generate(prompt)
|
||||
```
|
||||
|
||||
```bash
|
||||
bentoml serve service.py
|
||||
bentoml containerize llm:latest
|
||||
```
|
||||
|
||||
→ Docker 가 자동.
|
||||
|
||||
### Modal (managed, GPU)
|
||||
```python
|
||||
import modal
|
||||
|
||||
app = modal.App('llm')
|
||||
image = modal.Image.debian_slim().pip_install('vllm')
|
||||
|
||||
@app.cls(gpu='A100', image=image)
|
||||
class LLM:
|
||||
@modal.enter()
|
||||
def load(self):
|
||||
from vllm import LLM
|
||||
self.llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
|
||||
|
||||
@modal.method()
|
||||
def generate(self, prompt: str):
|
||||
return self.llm.generate([prompt])[0].outputs[0].text
|
||||
```
|
||||
|
||||
→ Pay per GPU-second. Managed scaling.
|
||||
|
||||
### Anyscale / Together / Replicate
|
||||
```
|
||||
Managed inference:
|
||||
- Anyscale (Ray + vLLM).
|
||||
- Together AI.
|
||||
- Replicate.
|
||||
- Hyperbolic.
|
||||
|
||||
→ Bring own model 또는 API.
|
||||
```
|
||||
|
||||
→ Self-host 의 alternative.
|
||||
|
||||
### Quantization
|
||||
```python
|
||||
# 4-bit (GPTQ / AWQ / bitsandbytes)
|
||||
from transformers import BitsAndBytesConfig
|
||||
|
||||
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype='float16')
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=config)
|
||||
```
|
||||
|
||||
→ Memory ↓ 4x. Quality 약간 ↓.
|
||||
|
||||
```
|
||||
Llama-3-8B:
|
||||
- FP16: 16 GB
|
||||
- INT8: 8 GB
|
||||
- INT4: 4 GB (single consumer GPU)
|
||||
- INT2 (extreme): 2 GB (quality 큰 ↓)
|
||||
```
|
||||
|
||||
### llama.cpp (CPU / Mac)
|
||||
```bash
|
||||
# GGUF format
|
||||
./llama-cli -m model.gguf -p 'Hello'
|
||||
|
||||
# Or server
|
||||
./llama-server -m model.gguf --port 8080
|
||||
```
|
||||
|
||||
→ Mac M1/M2/M3 친화. 작은 throughput.
|
||||
|
||||
### vLLM tensor parallelism
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model meta-llama/Llama-3-70B \
|
||||
--tensor-parallel-size 4
|
||||
```
|
||||
|
||||
→ 큰 model 가 4 GPU 가 분산.
|
||||
|
||||
### Speculative decoding (빠름)
|
||||
```python
|
||||
# Larger model + smaller draft
|
||||
llm = LLM(
|
||||
model='meta-llama/Llama-3-70B',
|
||||
speculative_model='meta-llama/Llama-3-8B',
|
||||
)
|
||||
```
|
||||
|
||||
→ 작은 model 가 draft, 큰 model 가 verify. 2-3x 빠름.
|
||||
|
||||
### Continuous batching
|
||||
```
|
||||
naive: request 1 → process → request 2 → process.
|
||||
Continuous: 매 step 가 다른 request 같이.
|
||||
|
||||
→ vLLM / TGI 가 default. GPU utilize ↑.
|
||||
```
|
||||
|
||||
### KV cache
|
||||
```
|
||||
매 token generation 가 attention compute.
|
||||
이전 token 의 K, V 가 cache.
|
||||
|
||||
prompt 가 길음 = KV cache 큰 = memory 많이.
|
||||
```
|
||||
|
||||
### Prompt caching
|
||||
```
|
||||
같은 system prompt 가 반복.
|
||||
- vLLM 의 prefix cache.
|
||||
- Anthropic / OpenAI 의 prompt cache (90% cost ↓).
|
||||
```
|
||||
|
||||
→ [[AI_Prompt_Caching]].
|
||||
|
||||
### Multi-tenant
|
||||
```
|
||||
1 model + N user:
|
||||
- Per-user 의 권한.
|
||||
- Per-user 의 rate limit.
|
||||
- Per-user 의 logging.
|
||||
|
||||
→ vLLM 가 single instance.
|
||||
Per-user 격리 = gateway level.
|
||||
```
|
||||
|
||||
### Per-user model
|
||||
```
|
||||
User 별 fine-tuned model:
|
||||
- LoRA adapter 만 다름.
|
||||
- Base 가 share.
|
||||
- vLLM 의 Multi-LoRA 지원.
|
||||
|
||||
vllm --enable-lora --lora-modules user1=path1 user2=path2
|
||||
```
|
||||
|
||||
### Cost
|
||||
```
|
||||
GPU rental:
|
||||
- A100 80GB: $1-3 / hour.
|
||||
- H100: $3-6 / hour.
|
||||
|
||||
자체 host:
|
||||
- A100 server: $30k+ / 매월 amortize.
|
||||
|
||||
API:
|
||||
- GPT-4o: $2.5 / MTok in, $10 / MTok out.
|
||||
- Claude Opus: $15 / $75.
|
||||
- Llama-3-8B (Together): $0.20 / MTok.
|
||||
|
||||
→ 큰 traffic = self-host 가능.
|
||||
작은 / variable = API.
|
||||
```
|
||||
|
||||
### Latency target
|
||||
```
|
||||
Chat: < 2s first token, < 50 ms / token after.
|
||||
Completion: 빠름 OK.
|
||||
Search: < 200 ms (low latency model).
|
||||
|
||||
→ Model size + GPU + batching trade-off.
|
||||
```
|
||||
|
||||
### Streaming
|
||||
```python
|
||||
# OpenAI-compatible
|
||||
async with client.chat.completions.create(..., stream=True) as stream:
|
||||
async for chunk in stream:
|
||||
yield chunk.choices[0].delta.content
|
||||
```
|
||||
|
||||
→ User-perceived latency ↓.
|
||||
|
||||
### Failover
|
||||
```
|
||||
Primary: vLLM (가장 빠름).
|
||||
Fallback: API (managed).
|
||||
|
||||
→ Primary down 시 fallback.
|
||||
```
|
||||
|
||||
### Eval at deploy
|
||||
```
|
||||
새 version deploy 전:
|
||||
- Latency benchmark.
|
||||
- Quality eval (golden set).
|
||||
- Compare vs current.
|
||||
|
||||
→ Regression 방지.
|
||||
```
|
||||
|
||||
### Monitoring
|
||||
```
|
||||
- Latency (p50, p99).
|
||||
- Throughput (tokens / sec).
|
||||
- GPU utilization.
|
||||
- Memory.
|
||||
- Error rate.
|
||||
- Cost / 1M token.
|
||||
|
||||
→ Datadog / Grafana / Helicone.
|
||||
```
|
||||
|
||||
### Scaling
|
||||
```
|
||||
Horizontal: 더 많은 instance.
|
||||
Vertical: 더 큰 GPU.
|
||||
Quantize: 작은 memory.
|
||||
Cache: hit rate ↑.
|
||||
|
||||
→ Auto-scale (Modal / K8s + KEDA).
|
||||
```
|
||||
|
||||
### Production stack 예
|
||||
```
|
||||
Cloudflare Workers (gateway)
|
||||
↓
|
||||
Anthropic / OpenAI (API) — 90% traffic
|
||||
↓ (failover or cost-sensitive)
|
||||
Self-host vLLM (GPU cluster) — 10%
|
||||
|
||||
→ Mix.
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 상황 | 추천 |
|
||||
|---|---|
|
||||
| 큰 traffic | vLLM cluster |
|
||||
| 작은 / variable | API (Anthropic / OpenAI) |
|
||||
| HuggingFace | TGI |
|
||||
| Local dev | Ollama |
|
||||
| Mac | llama.cpp |
|
||||
| Managed self-host | Modal / Anyscale |
|
||||
| LangChain | LangServe |
|
||||
| Multi-LoRA | vLLM with LoRA |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **Production 가 Ollama**: throughput 부족.
|
||||
- **No batching**: GPU idle.
|
||||
- **No quantization (작은 GPU)**: OOM.
|
||||
- **Streaming 안 함**: 사용자 wait.
|
||||
- **No prompt cache**: cost 폭발.
|
||||
- **Single instance + no failover**: down 시 crash.
|
||||
- **No eval at deploy**: regression.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- vLLM 가 open source 가장 빠름.
|
||||
- TGI 가 HuggingFace native.
|
||||
- Modal / Anyscale 가 managed self-host.
|
||||
- API + self-host mix.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_Local_LLM_Inference]]
|
||||
- [[AI_LLM_Cost_Optimization]]
|
||||
- [[MLOps_Model_Registry]]
|
||||
Reference in New Issue
Block a user