Files
2nd/10_Wiki/Topics/Coding/AI_Production_Deploy.md
T
2026-05-10 22:08:15 +09:00

7.7 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-production-deploy AI Production Deploy — vLLM / TGI / LangServe Coding draft B conceptual 2026-05-09 2026-05-09
ai
deploy
vibe-coding
language applicable_to
Python
AI
vLLM
TGI
Text Generation Inference
LangServe
BentoML
GPU inference
model serving

AI Production Deploy

Local LLM serving = simple. vLLM (가장 빠른), TGI (HuggingFace), LangServe (LangChain), Modal. GPU + batching + cache.

📖 핵심 개념

  • Inference engine: 매 token 의 cost.
  • Batching = 큰 throughput.
  • KV cache = context reuse.
  • Quantization = memory ↓.

💻 코드 패턴

vLLM (가장 빠름)

from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3-8B-Instruct')

prompts = ['Hello, ', 'The capital of France is ']
params = SamplingParams(temperature=0.8, max_tokens=100)

outputs = llm.generate(prompts, params)
for o in outputs:
    print(o.outputs[0].text)

vLLM API server

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8B-Instruct \
    --host 0.0.0.0 --port 8000
# OpenAI-compatible
curl http://localhost:8000/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -d '{"model": "meta-llama/Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hi"}]}'

→ OpenAI API 호환. drop-in replacement.

vLLM 의 강점

- PagedAttention (KV cache 효율).
- Continuous batching.
- 24/7 serving 친화.
- 가장 빠름 (open source).

→ Production default.

Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference \
    --model-id meta-llama/Llama-3-8B-Instruct
curl http://localhost:8080/generate \
    -d '{"inputs": "Hi", "parameters": {"max_new_tokens": 100}}'

→ HuggingFace native. Inference Endpoints 의 backend.

Ollama (local dev)

ollama pull llama3
ollama run llama3 'Hello'
# OpenAI-compatible API
curl http://localhost:11434/v1/chat/completions \
    -d '{"model": "llama3", "messages": [...]}'

→ Local dev / 작은 use case. Production X.

LangServe (LangChain)

from langserve import add_routes
from fastapi import FastAPI

app = FastAPI()
add_routes(app, my_chain, path='/chain')
uvicorn main:app --host 0.0.0.0

→ LangChain 의 chain 가 REST endpoint.

BentoML

import bentoml

@bentoml.service
class LLMService:
    model = bentoml.transformers.import_model('meta-llama/Llama-3-8B-Instruct')
    
    @bentoml.api
    def chat(self, prompt: str) -> str:
        return self.model.generate(prompt)
bentoml serve service.py
bentoml containerize llm:latest

→ Docker 가 자동.

Modal (managed, GPU)

import modal

app = modal.App('llm')
image = modal.Image.debian_slim().pip_install('vllm')

@app.cls(gpu='A100', image=image)
class LLM:
    @modal.enter()
    def load(self):
        from vllm import LLM
        self.llm = LLM(model='meta-llama/Llama-3-8B-Instruct')
    
    @modal.method()
    def generate(self, prompt: str):
        return self.llm.generate([prompt])[0].outputs[0].text

→ Pay per GPU-second. Managed scaling.

Anyscale / Together / Replicate

Managed inference:
- Anyscale (Ray + vLLM).
- Together AI.
- Replicate.
- Hyperbolic.

→ Bring own model 또는 API.

→ Self-host 의 alternative.

Quantization

# 4-bit (GPTQ / AWQ / bitsandbytes)
from transformers import BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype='float16')
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=config)

→ Memory ↓ 4x. Quality 약간 ↓.

Llama-3-8B:
- FP16: 16 GB
- INT8: 8 GB
- INT4: 4 GB (single consumer GPU)
- INT2 (extreme): 2 GB (quality 큰 ↓)

llama.cpp (CPU / Mac)

# GGUF format
./llama-cli -m model.gguf -p 'Hello'

# Or server
./llama-server -m model.gguf --port 8080

→ Mac M1/M2/M3 친화. 작은 throughput.

vLLM tensor parallelism

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70B \
    --tensor-parallel-size 4

→ 큰 model 가 4 GPU 가 분산.

Speculative decoding (빠름)

# Larger model + smaller draft
llm = LLM(
    model='meta-llama/Llama-3-70B',
    speculative_model='meta-llama/Llama-3-8B',
)

→ 작은 model 가 draft, 큰 model 가 verify. 2-3x 빠름.

Continuous batching

naive: request 1 → process → request 2 → process.
Continuous: 매 step 가 다른 request 같이.

→ vLLM / TGI 가 default. GPU utilize ↑.

KV cache

매 token generation 가 attention compute.
이전 token 의 K, V 가 cache.

prompt 가 길음 = KV cache 큰 = memory 많이.

Prompt caching

같은 system prompt 가 반복.
- vLLM 의 prefix cache.
- Anthropic / OpenAI 의 prompt cache (90% cost ↓).

AI_Prompt_Caching.

Multi-tenant

1 model + N user:
- Per-user 의 권한.
- Per-user 의 rate limit.
- Per-user 의 logging.

→ vLLM 가 single instance.
Per-user 격리 = gateway level.

Per-user model

User 별 fine-tuned model:
- LoRA adapter 만 다름.
- Base 가 share.
- vLLM 의 Multi-LoRA 지원.

vllm --enable-lora --lora-modules user1=path1 user2=path2

Cost

GPU rental:
- A100 80GB: $1-3 / hour.
- H100: $3-6 / hour.

자체 host:
- A100 server: $30k+ / 매월 amortize.

API:
- GPT-4o: $2.5 / MTok in, $10 / MTok out.
- Claude Opus: $15 / $75.
- Llama-3-8B (Together): $0.20 / MTok.

→ 큰 traffic = self-host 가능.
작은 / variable = API.

Latency target

Chat: < 2s first token, < 50 ms / token after.
Completion: 빠름 OK.
Search: < 200 ms (low latency model).

→ Model size + GPU + batching trade-off.

Streaming

# OpenAI-compatible
async with client.chat.completions.create(..., stream=True) as stream:
    async for chunk in stream:
        yield chunk.choices[0].delta.content

→ User-perceived latency ↓.

Failover

Primary: vLLM (가장 빠름).
Fallback: API (managed).

→ Primary down 시 fallback.

Eval at deploy

새 version deploy 전:
- Latency benchmark.
- Quality eval (golden set).
- Compare vs current.

→ Regression 방지.

Monitoring

- Latency (p50, p99).
- Throughput (tokens / sec).
- GPU utilization.
- Memory.
- Error rate.
- Cost / 1M token.

→ Datadog / Grafana / Helicone.

Scaling

Horizontal: 더 많은 instance.
Vertical: 더 큰 GPU.
Quantize: 작은 memory.
Cache: hit rate ↑.

→ Auto-scale (Modal / K8s + KEDA).

Production stack 예

Cloudflare Workers (gateway)
  ↓
Anthropic / OpenAI (API) — 90% traffic
  ↓ (failover or cost-sensitive)
Self-host vLLM (GPU cluster) — 10%

→ Mix.

🤔 의사결정 기준

상황 추천
큰 traffic vLLM cluster
작은 / variable API (Anthropic / OpenAI)
HuggingFace TGI
Local dev Ollama
Mac llama.cpp
Managed self-host Modal / Anyscale
LangChain LangServe
Multi-LoRA vLLM with LoRA

안티패턴

  • Production 가 Ollama: throughput 부족.
  • No batching: GPU idle.
  • No quantization (작은 GPU): OOM.
  • Streaming 안 함: 사용자 wait.
  • No prompt cache: cost 폭발.
  • Single instance + no failover: down 시 crash.
  • No eval at deploy: regression.

🤖 LLM 활용 힌트

  • vLLM 가 open source 가장 빠름.
  • TGI 가 HuggingFace native.
  • Modal / Anyscale 가 managed self-host.
  • API + self-host mix.

🔗 관련 문서