Files
2nd/10_Wiki/Topics/AI_and_ML/Open-Source-AI-Ecosystem.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.6 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-open-source-ai-ecosystem Open Source AI Ecosystem 10_Wiki/Topics verified self
OSS AI
Open Source LLM
Open Weights
Llama Ecosystem
none A 0.9 applied
open-source
llm
llama
mistral
qwen
huggingface
vllm
ollama
2026-05-10 pending
language framework
python transformers/vllm/ollama

Open Source AI Ecosystem

한 줄

2026년 OSS AI는 Llama·Mistral·Qwen·DeepSeek 가중치 + HuggingFace·vLLM·Ollama 인프라로 구성되며, 폐쇄 모델 70~90% 성능에 비용·자율성 우위.

핵심

  • Open Weights: Llama 4, Mistral Large 3, Qwen 3, DeepSeek-V4, Phi-5.
  • Hub: HuggingFace (모델 + dataset + spaces).
  • Inference: vLLM (throughput), TGI, llama.cpp (CPU/GGUF), MLX (Apple).
  • Local: Ollama (one-line run), LM Studio, LocalAI.
  • Fine-tune: Unsloth, axolotl, LLaMA-Factory, TRL.
  • Eval: lm-eval-harness, lighteval, MTEB.
  • Agents: LangGraph, LlamaIndex, smolagents, dspy.
  • License는 Apache-2.0 / MIT 안전, Llama Community License 조건부.

💻 패턴

# 1. Ollama: 로컬에서 모델 한 줄 실행
brew install ollama
ollama serve &
ollama run llama4:8b "Explain MoE in 2 sentences."
# REST: curl http://localhost:11434/api/generate -d '{"model":"llama4:8b","prompt":"hi"}'
# 2. HuggingFace transformers — 빠른 로드
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

mid = "Qwen/Qwen3-8B-Instruct"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
    mid, torch_dtype=torch.bfloat16, device_map="auto"
)

msgs = [{"role": "user", "content": "Summarize transformers in 1 line."}]
inp = tok.apply_chat_template(msgs, return_tensors="pt").to(model.device)
out = model.generate(inp, max_new_tokens=64)
print(tok.decode(out[0][inp.shape[-1]:], skip_special_tokens=True))
# 3. vLLM 서버 (OpenAI 호환 API, 고처리량)
# pip install vllm
# CLI:
#   vllm serve mistralai/Mistral-Large-3-Instruct \
#       --tensor-parallel-size 4 --max-model-len 32768
#
# Client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
    model="mistralai/Mistral-Large-3-Instruct",
    messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
# 4. llama.cpp / GGUF — CPU·소형 GPU 실행
# huggingface-cli download bartowski/Llama-4-8B-Instruct-GGUF llama-4-8b-instruct-q4_k_m.gguf
from llama_cpp import Llama

llm = Llama(
    model_path="./llama-4-8b-instruct-q4_k_m.gguf",
    n_ctx=8192, n_gpu_layers=-1,
)
print(llm("Q: What is RAG?\nA:", max_tokens=128)["choices"][0]["text"])
# 5. Unsloth — 2배 빠른 LoRA fine-tune
from unsloth import FastLanguageModel
from trl import SFTTrainer

model, tok = FastLanguageModel.from_pretrained(
    "unsloth/Llama-4-8B-Instruct",
    max_seq_length=4096, load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
    model, r=16, target_modules=["q_proj", "v_proj"], lora_alpha=16,
)

trainer = SFTTrainer(
    model=model, tokenizer=tok,
    train_dataset=train_ds, dataset_text_field="text",
    max_seq_length=4096,
)
trainer.train()
# 6. HuggingFace Hub: 데이터셋·모델 공유
from huggingface_hub import HfApi, snapshot_download

# 다운로드
snapshot_download("microsoft/Phi-5-mini-instruct", local_dir="./phi5")

# 업로드 (모델 push)
api = HfApi()
api.upload_folder(
    folder_path="./my-finetuned",
    repo_id="myuser/my-llama-finetune",
    repo_type="model",
)
# 7. lm-eval-harness — 표준 벤치마크
pip install lm-eval
lm_eval --model hf \
        --model_args pretrained=Qwen/Qwen3-8B-Instruct \
        --tasks mmlu,gsm8k,arc_challenge \
        --batch_size 8 --output_path results/
# 8. MTEB — embedding 모델 평가
from mteb import MTEB
from sentence_transformers import SentenceTransformer

m = SentenceTransformer("BAAI/bge-large-en-v1.5")
MTEB(tasks=["STSBenchmark", "Banking77Classification"]).run(m, output_folder="mteb_out")

결정 기준

시나리오 추천 스택
로컬 실험·데모 Ollama / LM Studio
Mac에서 빠르게 MLX + Llama 4
CPU only / Edge llama.cpp (GGUF q4)
프로덕션 서빙 vLLM + tensor-parallel
적은 GPU 수 fine-tune Unsloth + QLoRA
멀티노드 학습 axolotl / LLaMA-Factory
Embedding/RAG bge / e5 / nomic-embed
평가 lm-eval-harness, MTEB

🔗 Graph

  • Related: , , [[LLM_Optimization_and_Deployment_Strategies|Quantization]], [[RAG]], ``, [[LLM_Optimization_and_Deployment_Strategies|vLLM]], [[Ollama]], [[LoRA]]

🤖 LLM 활용

  • API 비용 절감: 자주 쓰는 routine task는 Qwen3-8B 로컬 + GPT-5는 어려운 경우만.
  • 데이터 민감도: 의료/금융 → 온프렘 vLLM.
  • Fine-tune 가능 = 도메인 적응 (closed model보다 큰 강점).

안티패턴

  • Llama 라이센스 조건 미확인하고 상용 배포 (>700M MAU 제한).
  • 양자화 q2 사용 후 품질 폭락 무시.
  • vLLM 없이 transformers generate()로 프로덕션 서빙 (느림).
  • HuggingFace에 secrets 포함 모델 push.

🧪 검증

  • ollama run 후 응답 시간 < 5s.
  • vLLM --max-num-seqs 늘려도 latency 안정.
  • lm-eval로 base vs fine-tune 차이 정량화.

🕓 Changelog

  • 2026-05-08 Phase 1: 초안.
  • 2026-05-10 Manual cleanup: 8개 패턴, vLLM/Unsloth/MTEB 갱신.