--- id: ai-production-deploy title: AI Production Deploy — vLLM / TGI / LangServe category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, deploy, vibe-coding] tech_stack: { language: "Python", applicable_to: ["AI"] } applied_in: [] aliases: [vLLM, TGI, Text Generation Inference, LangServe, BentoML, GPU inference, model serving] --- # AI Production Deploy > Local LLM serving = simple. **vLLM (가장 빠른), TGI (HuggingFace), LangServe (LangChain), Modal**. GPU + batching + cache. ## 📖 핵심 개념 - Inference engine: 매 token 의 cost. - Batching = 큰 throughput. - KV cache = context reuse. - Quantization = memory ↓. ## 💻 코드 패턴 ### vLLM (가장 빠름) ```python from vllm import LLM, SamplingParams llm = LLM(model='meta-llama/Llama-3-8B-Instruct') prompts = ['Hello, ', 'The capital of France is '] params = SamplingParams(temperature=0.8, max_tokens=100) outputs = llm.generate(prompts, params) for o in outputs: print(o.outputs[0].text) ``` ### vLLM API server ```bash python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3-8B-Instruct \ --host 0.0.0.0 --port 8000 ``` ```bash # OpenAI-compatible curl http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model": "meta-llama/Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hi"}]}' ``` → OpenAI API 호환. drop-in replacement. ### vLLM 의 강점 ``` - PagedAttention (KV cache 효율). - Continuous batching. - 24/7 serving 친화. - 가장 빠름 (open source). → Production default. ``` ### Text Generation Inference (TGI) ```bash docker run --gpus all -p 8080:80 \ -v /data:/data \ ghcr.io/huggingface/text-generation-inference \ --model-id meta-llama/Llama-3-8B-Instruct ``` ```bash curl http://localhost:8080/generate \ -d '{"inputs": "Hi", "parameters": {"max_new_tokens": 100}}' ``` → HuggingFace native. Inference Endpoints 의 backend. ### Ollama (local dev) ```bash ollama pull llama3 ollama run llama3 'Hello' ``` ```bash # OpenAI-compatible API curl http://localhost:11434/v1/chat/completions \ -d '{"model": "llama3", "messages": [...]}' ``` → Local dev / 작은 use case. Production X. ### LangServe (LangChain) ```python from langserve import add_routes from fastapi import FastAPI app = FastAPI() add_routes(app, my_chain, path='/chain') ``` ```bash uvicorn main:app --host 0.0.0.0 ``` → LangChain 의 chain 가 REST endpoint. ### BentoML ```python import bentoml @bentoml.service class LLMService: model = bentoml.transformers.import_model('meta-llama/Llama-3-8B-Instruct') @bentoml.api def chat(self, prompt: str) -> str: return self.model.generate(prompt) ``` ```bash bentoml serve service.py bentoml containerize llm:latest ``` → Docker 가 자동. ### Modal (managed, GPU) ```python import modal app = modal.App('llm') image = modal.Image.debian_slim().pip_install('vllm') @app.cls(gpu='A100', image=image) class LLM: @modal.enter() def load(self): from vllm import LLM self.llm = LLM(model='meta-llama/Llama-3-8B-Instruct') @modal.method() def generate(self, prompt: str): return self.llm.generate([prompt])[0].outputs[0].text ``` → Pay per GPU-second. Managed scaling. ### Anyscale / Together / Replicate ``` Managed inference: - Anyscale (Ray + vLLM). - Together AI. - Replicate. - Hyperbolic. → Bring own model 또는 API. ``` → Self-host 의 alternative. ### Quantization ```python # 4-bit (GPTQ / AWQ / bitsandbytes) from transformers import BitsAndBytesConfig config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype='float16') model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=config) ``` → Memory ↓ 4x. Quality 약간 ↓. ``` Llama-3-8B: - FP16: 16 GB - INT8: 8 GB - INT4: 4 GB (single consumer GPU) - INT2 (extreme): 2 GB (quality 큰 ↓) ``` ### llama.cpp (CPU / Mac) ```bash # GGUF format ./llama-cli -m model.gguf -p 'Hello' # Or server ./llama-server -m model.gguf --port 8080 ``` → Mac M1/M2/M3 친화. 작은 throughput. ### vLLM tensor parallelism ```bash python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3-70B \ --tensor-parallel-size 4 ``` → 큰 model 가 4 GPU 가 분산. ### Speculative decoding (빠름) ```python # Larger model + smaller draft llm = LLM( model='meta-llama/Llama-3-70B', speculative_model='meta-llama/Llama-3-8B', ) ``` → 작은 model 가 draft, 큰 model 가 verify. 2-3x 빠름. ### Continuous batching ``` naive: request 1 → process → request 2 → process. Continuous: 매 step 가 다른 request 같이. → vLLM / TGI 가 default. GPU utilize ↑. ``` ### KV cache ``` 매 token generation 가 attention compute. 이전 token 의 K, V 가 cache. prompt 가 길음 = KV cache 큰 = memory 많이. ``` ### Prompt caching ``` 같은 system prompt 가 반복. - vLLM 의 prefix cache. - Anthropic / OpenAI 의 prompt cache (90% cost ↓). ``` → [[AI_Prompt_Caching]]. ### Multi-tenant ``` 1 model + N user: - Per-user 의 권한. - Per-user 의 rate limit. - Per-user 의 logging. → vLLM 가 single instance. Per-user 격리 = gateway level. ``` ### Per-user model ``` User 별 fine-tuned model: - LoRA adapter 만 다름. - Base 가 share. - vLLM 의 Multi-LoRA 지원. vllm --enable-lora --lora-modules user1=path1 user2=path2 ``` ### Cost ``` GPU rental: - A100 80GB: $1-3 / hour. - H100: $3-6 / hour. 자체 host: - A100 server: $30k+ / 매월 amortize. API: - GPT-4o: $2.5 / MTok in, $10 / MTok out. - Claude Opus: $15 / $75. - Llama-3-8B (Together): $0.20 / MTok. → 큰 traffic = self-host 가능. 작은 / variable = API. ``` ### Latency target ``` Chat: < 2s first token, < 50 ms / token after. Completion: 빠름 OK. Search: < 200 ms (low latency model). → Model size + GPU + batching trade-off. ``` ### Streaming ```python # OpenAI-compatible async with client.chat.completions.create(..., stream=True) as stream: async for chunk in stream: yield chunk.choices[0].delta.content ``` → User-perceived latency ↓. ### Failover ``` Primary: vLLM (가장 빠름). Fallback: API (managed). → Primary down 시 fallback. ``` ### Eval at deploy ``` 새 version deploy 전: - Latency benchmark. - Quality eval (golden set). - Compare vs current. → Regression 방지. ``` ### Monitoring ``` - Latency (p50, p99). - Throughput (tokens / sec). - GPU utilization. - Memory. - Error rate. - Cost / 1M token. → Datadog / Grafana / Helicone. ``` ### Scaling ``` Horizontal: 더 많은 instance. Vertical: 더 큰 GPU. Quantize: 작은 memory. Cache: hit rate ↑. → Auto-scale (Modal / K8s + KEDA). ``` ### Production stack 예 ``` Cloudflare Workers (gateway) ↓ Anthropic / OpenAI (API) — 90% traffic ↓ (failover or cost-sensitive) Self-host vLLM (GPU cluster) — 10% → Mix. ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | 큰 traffic | vLLM cluster | | 작은 / variable | API (Anthropic / OpenAI) | | HuggingFace | TGI | | Local dev | Ollama | | Mac | llama.cpp | | Managed self-host | Modal / Anyscale | | LangChain | LangServe | | Multi-LoRA | vLLM with LoRA | ## ❌ 안티패턴 - **Production 가 Ollama**: throughput 부족. - **No batching**: GPU idle. - **No quantization (작은 GPU)**: OOM. - **Streaming 안 함**: 사용자 wait. - **No prompt cache**: cost 폭발. - **Single instance + no failover**: down 시 crash. - **No eval at deploy**: regression. ## 🤖 LLM 활용 힌트 - vLLM 가 open source 가장 빠름. - TGI 가 HuggingFace native. - Modal / Anyscale 가 managed self-host. - API + self-host mix. ## 🔗 관련 문서 - [[AI_Local_LLM_Inference]] - [[AI_LLM_Cost_Optimization]] - [[MLOps_Model_Registry]]