[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -0,0 +1,380 @@
---
id: ai-production-deploy
title: AI Production Deploy — vLLM / TGI / LangServe
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, deploy, vibe-coding]
tech_stack: { language: "Python", applicable_to: ["AI"] }
applied_in: []
aliases: [vLLM, TGI, Text Generation Inference, LangServe, BentoML, GPU inference, model serving]
---
# AI Production Deploy
> Local LLM serving = simple. **vLLM (가장 빠른), TGI (HuggingFace), LangServe (LangChain), Modal**. GPU + batching + cache.
## 📖 핵심 개념
- Inference engine: 매 token 의 cost.
- Batching = 큰 throughput.
- KV cache = context reuse.
- Quantization = memory ↓.
## 💻 코드 패턴
### vLLM (가장 빠름)
```python
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
prompts = ['Hello, ', 'The capital of France is ']
params = SamplingParams(temperature=0.8, max_tokens=100)
outputs = llm.generate(prompts, params)
for o in outputs:
print(o.outputs[0].text)
```
### vLLM API server
```bash
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--host 0.0.0.0 --port 8000
```
```bash
# OpenAI-compatible
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "meta-llama/Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hi"}]}'
```
→ OpenAI API 호환. drop-in replacement.
### vLLM 의 강점
```
- PagedAttention (KV cache 효율).
- Continuous batching.
- 24/7 serving 친화.
- 가장 빠름 (open source).
→ Production default.
```
### Text Generation Inference (TGI)
```bash
docker run --gpus all -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference \
--model-id meta-llama/Llama-3-8B-Instruct
```
```bash
curl http://localhost:8080/generate \
-d '{"inputs": "Hi", "parameters": {"max_new_tokens": 100}}'
```
→ HuggingFace native. Inference Endpoints 의 backend.
### Ollama (local dev)
```bash
ollama pull llama3
ollama run llama3 'Hello'
```
```bash
# OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
-d '{"model": "llama3", "messages": [...]}'
```
→ Local dev / 작은 use case. Production X.
### LangServe (LangChain)
```python
from langserve import add_routes
from fastapi import FastAPI
app = FastAPI()
add_routes(app, my_chain, path='/chain')
```
```bash
uvicorn main:app --host 0.0.0.0
```
→ LangChain 의 chain 가 REST endpoint.
### BentoML
```python
import bentoml
@bentoml.service
class LLMService:
model = bentoml.transformers.import_model('meta-llama/Llama-3-8B-Instruct')
@bentoml.api
def chat(self, prompt: str) -> str:
return self.model.generate(prompt)
```
```bash
bentoml serve service.py
bentoml containerize llm:latest
```
→ Docker 가 자동.
### Modal (managed, GPU)
```python
import modal
app = modal.App('llm')
image = modal.Image.debian_slim().pip_install('vllm')
@app.cls(gpu='A100', image=image)
class LLM:
@modal.enter()
def load(self):
from vllm import LLM
self.llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
@modal.method()
def generate(self, prompt: str):
return self.llm.generate([prompt])[0].outputs[0].text
```
→ Pay per GPU-second. Managed scaling.
### Anyscale / Together / Replicate
```
Managed inference:
- Anyscale (Ray + vLLM).
- Together AI.
- Replicate.
- Hyperbolic.
→ Bring own model 또는 API.
```
→ Self-host 의 alternative.
### Quantization
```python
# 4-bit (GPTQ / AWQ / bitsandbytes)
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype='float16')
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=config)
```
→ Memory ↓ 4x. Quality 약간 ↓.
```
Llama-3-8B:
- FP16: 16 GB
- INT8: 8 GB
- INT4: 4 GB (single consumer GPU)
- INT2 (extreme): 2 GB (quality 큰 ↓)
```
### llama.cpp (CPU / Mac)
```bash
# GGUF format
./llama-cli -m model.gguf -p 'Hello'
# Or server
./llama-server -m model.gguf --port 8080
```
→ Mac M1/M2/M3 친화. 작은 throughput.
### vLLM tensor parallelism
```bash
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B \
--tensor-parallel-size 4
```
→ 큰 model 가 4 GPU 가 분산.
### Speculative decoding (빠름)
```python
# Larger model + smaller draft
llm = LLM(
model='meta-llama/Llama-3-70B',
speculative_model='meta-llama/Llama-3-8B',
)
```
→ 작은 model 가 draft, 큰 model 가 verify. 2-3x 빠름.
### Continuous batching
```
naive: request 1 → process → request 2 → process.
Continuous: 매 step 가 다른 request 같이.
→ vLLM / TGI 가 default. GPU utilize ↑.
```
### KV cache
```
매 token generation 가 attention compute.
이전 token 의 K, V 가 cache.
prompt 가 길음 = KV cache 큰 = memory 많이.
```
### Prompt caching
```
같은 system prompt 가 반복.
- vLLM 의 prefix cache.
- Anthropic / OpenAI 의 prompt cache (90% cost ↓).
```
→ [[AI_Prompt_Caching]].
### Multi-tenant
```
1 model + N user:
- Per-user 의 권한.
- Per-user 의 rate limit.
- Per-user 의 logging.
→ vLLM 가 single instance.
Per-user 격리 = gateway level.
```
### Per-user model
```
User 별 fine-tuned model:
- LoRA adapter 만 다름.
- Base 가 share.
- vLLM 의 Multi-LoRA 지원.
vllm --enable-lora --lora-modules user1=path1 user2=path2
```
### Cost
```
GPU rental:
- A100 80GB: $1-3 / hour.
- H100: $3-6 / hour.
자체 host:
- A100 server: $30k+ / 매월 amortize.
API:
- GPT-4o: $2.5 / MTok in, $10 / MTok out.
- Claude Opus: $15 / $75.
- Llama-3-8B (Together): $0.20 / MTok.
→ 큰 traffic = self-host 가능.
작은 / variable = API.
```
### Latency target
```
Chat: < 2s first token, < 50 ms / token after.
Completion: 빠름 OK.
Search: < 200 ms (low latency model).
→ Model size + GPU + batching trade-off.
```
### Streaming
```python
# OpenAI-compatible
async with client.chat.completions.create(..., stream=True) as stream:
async for chunk in stream:
yield chunk.choices[0].delta.content
```
→ User-perceived latency ↓.
### Failover
```
Primary: vLLM (가장 빠름).
Fallback: API (managed).
→ Primary down 시 fallback.
```
### Eval at deploy
```
새 version deploy 전:
- Latency benchmark.
- Quality eval (golden set).
- Compare vs current.
→ Regression 방지.
```
### Monitoring
```
- Latency (p50, p99).
- Throughput (tokens / sec).
- GPU utilization.
- Memory.
- Error rate.
- Cost / 1M token.
→ Datadog / Grafana / Helicone.
```
### Scaling
```
Horizontal: 더 많은 instance.
Vertical: 더 큰 GPU.
Quantize: 작은 memory.
Cache: hit rate ↑.
→ Auto-scale (Modal / K8s + KEDA).
```
### Production stack 예
```
Cloudflare Workers (gateway)
Anthropic / OpenAI (API) — 90% traffic
↓ (failover or cost-sensitive)
Self-host vLLM (GPU cluster) — 10%
→ Mix.
```
## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 큰 traffic | vLLM cluster |
| 작은 / variable | API (Anthropic / OpenAI) |
| HuggingFace | TGI |
| Local dev | Ollama |
| Mac | llama.cpp |
| Managed self-host | Modal / Anyscale |
| LangChain | LangServe |
| Multi-LoRA | vLLM with LoRA |
## ❌ 안티패턴
- **Production 가 Ollama**: throughput 부족.
- **No batching**: GPU idle.
- **No quantization (작은 GPU)**: OOM.
- **Streaming 안 함**: 사용자 wait.
- **No prompt cache**: cost 폭발.
- **Single instance + no failover**: down 시 crash.
- **No eval at deploy**: regression.
## 🤖 LLM 활용 힌트
- vLLM 가 open source 가장 빠름.
- TGI 가 HuggingFace native.
- Modal / Anyscale 가 managed self-host.
- API + self-host mix.
## 🔗 관련 문서
- [[AI_Local_LLM_Inference]]
- [[AI_LLM_Cost_Optimization]]
- [[MLOps_Model_Registry]]