[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,192 @@
---
id: ai-local-llm-inference
title: Local LLM — Ollama / LM Studio / vLLM
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, llm, local, ollama, vllm, vibe-coding]
tech_stack: { language: "TS / Python / GGUF", applicable_to: ["Backend"] }
applied_in: []
aliases: [Ollama, LM Studio, vLLM, llama.cpp, GGUF, on-prem LLM, quantization]
---
# Local LLM Inference
> Privacy / cost / latency 위해 자체 inference. **Ollama / LM Studio = 개발 + 데스크탑, vLLM / SGLang / TGI = production server**. GGUF (CPU/Metal), AWQ/GPTQ (GPU) 등 양자화.
## 📖 핵심 개념
- Quantization: 모델 작게 (4-bit / 8-bit).
- GGUF: llama.cpp 포맷 (CPU/Metal/Apple Silicon).
- vLLM: GPU + PagedAttention — 빠른 batching.
- OpenAI-compatible API: 표준 endpoint.
## 💻 코드 패턴
### Ollama (가장 단순, 데스크탑)
```bash
brew install ollama
ollama pull llama3.2:8b
ollama run llama3.2:8b
```
```ts
// OpenAI-compatible API
import OpenAI from 'openai';
const client = new OpenAI({ baseURL: 'http://localhost:11434/v1', apiKey: 'ollama' });
const r = await client.chat.completions.create({
model: 'llama3.2:8b',
messages: [{ role: 'user', content: 'Hello' }],
});
```
### LM Studio (GUI)
- 다운로드 + 모델 선택 + 채팅 / API server.
- Ollama 와 비슷, GUI 친화.
### vLLM (production GPU)
```bash
pip install vllm
# Server 시작
vllm serve meta-llama/Llama-3.2-8B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9
# OpenAI-compatible endpoint
```
```ts
const client = new OpenAI({ baseURL: 'http://vllm:8000/v1', apiKey: 'EMPTY' });
```
### llama.cpp + GGUF (CPU / Mac)
```bash
brew install llama.cpp
# 다운로드 GGUF 모델 (Hugging Face)
llama-cli -m llama-3-8b-Q4_K_M.gguf -p "Hello"
# Server
llama-server -m llama-3-8b-Q4_K_M.gguf --port 8080
```
```ts
// 같은 OpenAI API
const client = new OpenAI({ baseURL: 'http://localhost:8080/v1' });
```
### SGLang (vLLM 대안, 빠름)
```bash
python -m sglang.launch_server --model meta-llama/Llama-3.2-8B-Instruct --port 30000
```
### TGI (Hugging Face)
```bash
docker run -p 8080:80 -v ./data:/data --gpus all \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.2-8B-Instruct
```
### Quantization 비교 (8B 모델 가정)
```
FP16: 16 GB VRAM
INT8: 8 GB
Q4_K_M: 4.5 GB (GGUF) — 데스크탑 OK
AWQ-4: 5 GB GPU
```
→ Q4 거의 손실 없음 (특히 작은 모델 외).
### Hardware
```
Apple Silicon M2/M3/M4: GGUF + Metal.
NVIDIA: vLLM + CUDA.
AMD: ROCm + vLLM.
Intel: IPEX-LLM.
```
### Streaming
```ts
const stream = await client.chat.completions.create({
model: 'llama3.2:8b',
messages,
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
```
### Function calling (일부 모델)
```ts
// Llama 3.2+, Qwen, Hermes 가 tool use 지원
const r = await client.chat.completions.create({
model: 'llama3.2:8b',
messages,
tools: [{ type: 'function', function: { name: 'search', parameters: {...} } }],
});
```
⚠️ Cloud 만큼 reliable 하지 않음 — fallback 가지자.
### Cost / latency (대략)
```
Ollama M3 Max (8B Q4): ~30 tok/s
vLLM A100 (70B): ~50 tok/s
Cloud API (GPT-4o): ~80-150 tok/s
Cost:
Cloud: $0.50-15 per 1M tok
Local: 전기 + hw 감가 ~$0
```
### Privacy / GDPR
- Local = 데이터 외부 X.
- Air-gapped 가능.
- Compliance 강.
### Model 선택 (2026 기준)
```
Llama 3.3 70B: 강 — 24GB+ GPU
Llama 3.2 8B: 균형 — 8GB
Qwen 2.5 7B / 14B: 한국어 / 코드 강
Mistral Small 3 24B: 추론 강
Gemma 2 9B: 작고 빠름
DeepSeek-R1 distill: 추론
```
## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 개발 / 실험 | Ollama / LM Studio |
| Privacy / 기업 내부 | vLLM self-host |
| Mac dev | llama.cpp / Ollama |
| 큰 throughput prod | vLLM / SGLang |
| Edge device | llama.cpp + 작은 모델 (1-3B) |
| Cloud cost 큼 | Local + cloud fallback |
## ❌ 안티패턴
- **Cloud level 정확도 가정**: 작은 local 모델 = 약함. Use case 검증.
- **Quantize 너무 강 (Q2)**: 품질 추락.
- **GPU 부족 + 큰 모델**: OOM. 작은 모델 또는 더 강 quant.
- **OpenAI API 그대로 사용**: tool / structured output 가 일관 X. 검증.
- **Single instance prod**: HA — load balancer + N replicas.
- **Streaming + sync app**: latency. async stream.
- **Updates 추적 X**: 새 model 가 매주 — 정기 evaluation.
## 🤖 LLM 활용 힌트
- 시작 = Ollama.
- Production = vLLM (GPU).
- 8B Q4 가 보통 충분.
- OpenAI-compatible API → 코드 변경 X.
## 🔗 관련 문서
- [[AI_Prompt_Engineering_Patterns]]
- [[AI_Function_Calling_Deep]]
- [[AI_Streaming_LLM_Response]]