--- id: ai-local-llm-inference title: Local LLM — Ollama / LM Studio / vLLM category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, llm, local, ollama, vllm, vibe-coding] tech_stack: { language: "TS / Python / GGUF", applicable_to: ["Backend"] } applied_in: [] aliases: [Ollama, LM Studio, vLLM, llama.cpp, GGUF, on-prem LLM, quantization] --- # Local LLM Inference > Privacy / cost / latency 위해 자체 inference. **Ollama / LM Studio = 개발 + 데스크탑, vLLM / SGLang / TGI = production server**. GGUF (CPU/Metal), AWQ/GPTQ (GPU) 등 양자화. ## 📖 핵심 개념 - Quantization: 모델 작게 (4-bit / 8-bit). - GGUF: llama.cpp 포맷 (CPU/Metal/Apple Silicon). - vLLM: GPU + PagedAttention — 빠른 batching. - OpenAI-compatible API: 표준 endpoint. ## 💻 코드 패턴 ### Ollama (가장 단순, 데스크탑) ```bash brew install ollama ollama pull llama3.2:8b ollama run llama3.2:8b ``` ```ts // OpenAI-compatible API import OpenAI from 'openai'; const client = new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama' }); const r = await client.chat.completions.create({ model: 'llama3.2:8b', messages: [{ role: 'user', content: 'Hello' }], }); ``` ### LM Studio (GUI) - 다운로드 + 모델 선택 + 채팅 / API server. - Ollama 와 비슷, GUI 친화. ### vLLM (production GPU) ```bash pip install vllm # Server 시작 vllm serve meta-llama/Llama-3.2-8B-Instruct \ --tensor-parallel-size 2 \ --max-model-len 8192 \ --gpu-memory-utilization 0.9 # OpenAI-compatible endpoint ``` ```ts const client = new OpenAI({ baseURL: 'http://vllm:8000/v1', apiKey: 'EMPTY' }); ``` ### llama.cpp + GGUF (CPU / Mac) ```bash brew install llama.cpp # 다운로드 GGUF 모델 (Hugging Face) llama-cli -m llama-3-8b-Q4_K_M.gguf -p "Hello" # Server llama-server -m llama-3-8b-Q4_K_M.gguf --port 8080 ``` ```ts // 같은 OpenAI API const client = new OpenAI({ baseURL: 'http://localhost:8080/v1' }); ``` ### SGLang (vLLM 대안, 빠름) ```bash python -m sglang.launch_server --model meta-llama/Llama-3.2-8B-Instruct --port 30000 ``` ### TGI (Hugging Face) ```bash docker run -p 8080:80 -v ./data:/data --gpus all \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id meta-llama/Llama-3.2-8B-Instruct ``` ### Quantization 비교 (8B 모델 가정) ``` FP16: 16 GB VRAM INT8: 8 GB Q4_K_M: 4.5 GB (GGUF) — 데스크탑 OK AWQ-4: 5 GB GPU ``` → Q4 거의 손실 없음 (특히 작은 모델 외). ### Hardware ``` Apple Silicon M2/M3/M4: GGUF + Metal. NVIDIA: vLLM + CUDA. AMD: ROCm + vLLM. Intel: IPEX-LLM. ``` ### Streaming ```ts const stream = await client.chat.completions.create({ model: 'llama3.2:8b', messages, stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0]?.delta?.content ?? ''); } ``` ### Function calling (일부 모델) ```ts // Llama 3.2+, Qwen, Hermes 가 tool use 지원 const r = await client.chat.completions.create({ model: 'llama3.2:8b', messages, tools: [{ type: 'function', function: { name: 'search', parameters: {...} } }], }); ``` ⚠️ Cloud 만큼 reliable 하지 않음 — fallback 가지자. ### Cost / latency (대략) ``` Ollama M3 Max (8B Q4): ~30 tok/s vLLM A100 (70B): ~50 tok/s Cloud API (GPT-4o): ~80-150 tok/s Cost: Cloud: $0.50-15 per 1M tok Local: 전기 + hw 감가 ~$0 ``` ### Privacy / GDPR - Local = 데이터 외부 X. - Air-gapped 가능. - Compliance 강. ### Model 선택 (2026 기준) ``` Llama 3.3 70B: 강 — 24GB+ GPU Llama 3.2 8B: 균형 — 8GB Qwen 2.5 7B / 14B: 한국어 / 코드 강 Mistral Small 3 24B: 추론 강 Gemma 2 9B: 작고 빠름 DeepSeek-R1 distill: 추론 ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | 개발 / 실험 | Ollama / LM Studio | | Privacy / 기업 내부 | vLLM self-host | | Mac dev | llama.cpp / Ollama | | 큰 throughput prod | vLLM / SGLang | | Edge device | llama.cpp + 작은 모델 (1-3B) | | Cloud cost 큼 | Local + cloud fallback | ## ❌ 안티패턴 - **Cloud level 정확도 가정**: 작은 local 모델 = 약함. Use case 검증. - **Quantize 너무 강 (Q2)**: 품질 추락. - **GPU 부족 + 큰 모델**: OOM. 작은 모델 또는 더 강 quant. - **OpenAI API 그대로 사용**: tool / structured output 가 일관 X. 검증. - **Single instance prod**: HA — load balancer + N replicas. - **Streaming + sync app**: latency. async stream. - **Updates 추적 X**: 새 model 가 매주 — 정기 evaluation. ## 🤖 LLM 활용 힌트 - 시작 = Ollama. - Production = vLLM (GPU). - 8B Q4 가 보통 충분. - OpenAI-compatible API → 코드 변경 X. ## 🔗 관련 문서 - [[AI_Prompt_Engineering_Patterns]] - [[AI_Function_Calling_Deep]] - [[AI_Streaming_LLM_Response]]