f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
180 lines
5.6 KiB
Markdown
180 lines
5.6 KiB
Markdown
---
|
|
id: wiki-2026-0508-open-source-ai-ecosystem
|
|
title: Open Source AI Ecosystem
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [OSS AI, Open Source LLM, Open Weights, Llama Ecosystem]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [open-source, llm, llama, mistral, qwen, huggingface, vllm, ollama]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack: { language: python, framework: transformers/vllm/ollama }
|
|
---
|
|
|
|
# Open Source AI Ecosystem
|
|
|
|
## 한 줄
|
|
2026년 OSS AI는 Llama·Mistral·Qwen·DeepSeek 가중치 + HuggingFace·vLLM·Ollama 인프라로 구성되며, 폐쇄 모델 70~90% 성능에 비용·자율성 우위.
|
|
|
|
## 핵심
|
|
- **Open Weights**: Llama 4, Mistral Large 3, Qwen 3, DeepSeek-V4, Phi-5.
|
|
- **Hub**: HuggingFace (모델 + dataset + spaces).
|
|
- **Inference**: vLLM (throughput), TGI, llama.cpp (CPU/GGUF), MLX (Apple).
|
|
- **Local**: Ollama (one-line run), LM Studio, LocalAI.
|
|
- **Fine-tune**: Unsloth, axolotl, LLaMA-Factory, TRL.
|
|
- **Eval**: lm-eval-harness, lighteval, MTEB.
|
|
- **Agents**: LangGraph, LlamaIndex, smolagents, dspy.
|
|
- License는 Apache-2.0 / MIT 안전, Llama Community License 조건부.
|
|
|
|
## 💻 패턴
|
|
|
|
```bash
|
|
# 1. Ollama: 로컬에서 모델 한 줄 실행
|
|
brew install ollama
|
|
ollama serve &
|
|
ollama run llama4:8b "Explain MoE in 2 sentences."
|
|
# REST: curl http://localhost:11434/api/generate -d '{"model":"llama4:8b","prompt":"hi"}'
|
|
```
|
|
|
|
```python
|
|
# 2. HuggingFace transformers — 빠른 로드
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
import torch
|
|
|
|
mid = "Qwen/Qwen3-8B-Instruct"
|
|
tok = AutoTokenizer.from_pretrained(mid)
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
mid, torch_dtype=torch.bfloat16, device_map="auto"
|
|
)
|
|
|
|
msgs = [{"role": "user", "content": "Summarize transformers in 1 line."}]
|
|
inp = tok.apply_chat_template(msgs, return_tensors="pt").to(model.device)
|
|
out = model.generate(inp, max_new_tokens=64)
|
|
print(tok.decode(out[0][inp.shape[-1]:], skip_special_tokens=True))
|
|
```
|
|
|
|
```python
|
|
# 3. vLLM 서버 (OpenAI 호환 API, 고처리량)
|
|
# pip install vllm
|
|
# CLI:
|
|
# vllm serve mistralai/Mistral-Large-3-Instruct \
|
|
# --tensor-parallel-size 4 --max-model-len 32768
|
|
#
|
|
# Client:
|
|
from openai import OpenAI
|
|
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
|
|
resp = client.chat.completions.create(
|
|
model="mistralai/Mistral-Large-3-Instruct",
|
|
messages=[{"role": "user", "content": "Hello"}],
|
|
)
|
|
print(resp.choices[0].message.content)
|
|
```
|
|
|
|
```python
|
|
# 4. llama.cpp / GGUF — CPU·소형 GPU 실행
|
|
# huggingface-cli download bartowski/Llama-4-8B-Instruct-GGUF llama-4-8b-instruct-q4_k_m.gguf
|
|
from llama_cpp import Llama
|
|
|
|
llm = Llama(
|
|
model_path="./llama-4-8b-instruct-q4_k_m.gguf",
|
|
n_ctx=8192, n_gpu_layers=-1,
|
|
)
|
|
print(llm("Q: What is RAG?\nA:", max_tokens=128)["choices"][0]["text"])
|
|
```
|
|
|
|
```python
|
|
# 5. Unsloth — 2배 빠른 LoRA fine-tune
|
|
from unsloth import FastLanguageModel
|
|
from trl import SFTTrainer
|
|
|
|
model, tok = FastLanguageModel.from_pretrained(
|
|
"unsloth/Llama-4-8B-Instruct",
|
|
max_seq_length=4096, load_in_4bit=True,
|
|
)
|
|
model = FastLanguageModel.get_peft_model(
|
|
model, r=16, target_modules=["q_proj", "v_proj"], lora_alpha=16,
|
|
)
|
|
|
|
trainer = SFTTrainer(
|
|
model=model, tokenizer=tok,
|
|
train_dataset=train_ds, dataset_text_field="text",
|
|
max_seq_length=4096,
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
```python
|
|
# 6. HuggingFace Hub: 데이터셋·모델 공유
|
|
from huggingface_hub import HfApi, snapshot_download
|
|
|
|
# 다운로드
|
|
snapshot_download("microsoft/Phi-5-mini-instruct", local_dir="./phi5")
|
|
|
|
# 업로드 (모델 push)
|
|
api = HfApi()
|
|
api.upload_folder(
|
|
folder_path="./my-finetuned",
|
|
repo_id="myuser/my-llama-finetune",
|
|
repo_type="model",
|
|
)
|
|
```
|
|
|
|
```bash
|
|
# 7. lm-eval-harness — 표준 벤치마크
|
|
pip install lm-eval
|
|
lm_eval --model hf \
|
|
--model_args pretrained=Qwen/Qwen3-8B-Instruct \
|
|
--tasks mmlu,gsm8k,arc_challenge \
|
|
--batch_size 8 --output_path results/
|
|
```
|
|
|
|
```python
|
|
# 8. MTEB — embedding 모델 평가
|
|
from mteb import MTEB
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
m = SentenceTransformer("BAAI/bge-large-en-v1.5")
|
|
MTEB(tasks=["STSBenchmark", "Banking77Classification"]).run(m, output_folder="mteb_out")
|
|
```
|
|
|
|
## 결정 기준
|
|
|
|
| 시나리오 | 추천 스택 |
|
|
|---|---|
|
|
| 로컬 실험·데모 | Ollama / LM Studio |
|
|
| Mac에서 빠르게 | MLX + Llama 4 |
|
|
| CPU only / Edge | llama.cpp (GGUF q4) |
|
|
| 프로덕션 서빙 | vLLM + tensor-parallel |
|
|
| 적은 GPU 수 fine-tune | Unsloth + QLoRA |
|
|
| 멀티노드 학습 | axolotl / LLaMA-Factory |
|
|
| Embedding/RAG | bge / e5 / nomic-embed |
|
|
| 평가 | lm-eval-harness, MTEB |
|
|
|
|
## 🔗 Graph
|
|
- Related: ``, ``, `[[LLM_Optimization_and_Deployment_Strategies|Quantization]]`, `[[RAG]]`, ``, `[[LLM_Optimization_and_Deployment_Strategies|vLLM]]`, `[[Ollama]]`, `[[LoRA]]`
|
|
|
|
## 🤖 LLM 활용
|
|
- API 비용 절감: 자주 쓰는 routine task는 Qwen3-8B 로컬 + GPT-5는 어려운 경우만.
|
|
- 데이터 민감도: 의료/금융 → 온프렘 vLLM.
|
|
- Fine-tune 가능 = 도메인 적응 (closed model보다 큰 강점).
|
|
|
|
## ❌ 안티패턴
|
|
- Llama 라이센스 조건 미확인하고 상용 배포 (>700M MAU 제한).
|
|
- 양자화 q2 사용 후 품질 폭락 무시.
|
|
- vLLM 없이 transformers `generate()`로 프로덕션 서빙 (느림).
|
|
- HuggingFace에 secrets 포함 모델 push.
|
|
|
|
## 🧪 검증
|
|
- `ollama run` 후 응답 시간 < 5s.
|
|
- vLLM `--max-num-seqs` 늘려도 latency 안정.
|
|
- lm-eval로 base vs fine-tune 차이 정량화.
|
|
|
|
## 🕓 Changelog
|
|
- 2026-05-08 Phase 1: 초안.
|
|
- 2026-05-10 Manual cleanup: 8개 패턴, vLLM/Unsloth/MTEB 갱신.
|