--- id: wiki-2026-0508-open-source-ai-ecosystem title: Open Source AI Ecosystem category: 10_Wiki/Topics status: verified canonical_id: self aliases: [OSS AI, Open Source LLM, Open Weights, Llama Ecosystem] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [open-source, llm, llama, mistral, qwen, huggingface, vllm, ollama] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: { language: python, framework: transformers/vllm/ollama } --- # Open Source AI Ecosystem ## 한 줄 2026년 OSS AI는 Llama·Mistral·Qwen·DeepSeek 가중치 + HuggingFace·vLLM·Ollama 인프라로 구성되며, 폐쇄 모델 70~90% 성능에 비용·자율성 우위. ## 핵심 - **Open Weights**: Llama 4, Mistral Large 3, Qwen 3, DeepSeek-V4, Phi-5. - **Hub**: HuggingFace (모델 + dataset + spaces). - **Inference**: vLLM (throughput), TGI, llama.cpp (CPU/GGUF), MLX (Apple). - **Local**: Ollama (one-line run), LM Studio, LocalAI. - **Fine-tune**: Unsloth, axolotl, LLaMA-Factory, TRL. - **Eval**: lm-eval-harness, lighteval, MTEB. - **Agents**: LangGraph, LlamaIndex, smolagents, dspy. - License는 Apache-2.0 / MIT 안전, Llama Community License 조건부. ## 💻 패턴 ```bash # 1. Ollama: 로컬에서 모델 한 줄 실행 brew install ollama ollama serve & ollama run llama4:8b "Explain MoE in 2 sentences." # REST: curl http://localhost:11434/api/generate -d '{"model":"llama4:8b","prompt":"hi"}' ``` ```python # 2. HuggingFace transformers — 빠른 로드 from transformers import AutoModelForCausalLM, AutoTokenizer import torch mid = "Qwen/Qwen3-8B-Instruct" tok = AutoTokenizer.from_pretrained(mid) model = AutoModelForCausalLM.from_pretrained( mid, torch_dtype=torch.bfloat16, device_map="auto" ) msgs = [{"role": "user", "content": "Summarize transformers in 1 line."}] inp = tok.apply_chat_template(msgs, return_tensors="pt").to(model.device) out = model.generate(inp, max_new_tokens=64) print(tok.decode(out[0][inp.shape[-1]:], skip_special_tokens=True)) ``` ```python # 3. vLLM 서버 (OpenAI 호환 API, 고처리량) # pip install vllm # CLI: # vllm serve mistralai/Mistral-Large-3-Instruct \ # --tensor-parallel-size 4 --max-model-len 32768 # # Client: from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY") resp = client.chat.completions.create( model="mistralai/Mistral-Large-3-Instruct", messages=[{"role": "user", "content": "Hello"}], ) print(resp.choices[0].message.content) ``` ```python # 4. llama.cpp / GGUF — CPU·소형 GPU 실행 # huggingface-cli download bartowski/Llama-4-8B-Instruct-GGUF llama-4-8b-instruct-q4_k_m.gguf from llama_cpp import Llama llm = Llama( model_path="./llama-4-8b-instruct-q4_k_m.gguf", n_ctx=8192, n_gpu_layers=-1, ) print(llm("Q: What is RAG?\nA:", max_tokens=128)["choices"][0]["text"]) ``` ```python # 5. Unsloth — 2배 빠른 LoRA fine-tune from unsloth import FastLanguageModel from trl import SFTTrainer model, tok = FastLanguageModel.from_pretrained( "unsloth/Llama-4-8B-Instruct", max_seq_length=4096, load_in_4bit=True, ) model = FastLanguageModel.get_peft_model( model, r=16, target_modules=["q_proj", "v_proj"], lora_alpha=16, ) trainer = SFTTrainer( model=model, tokenizer=tok, train_dataset=train_ds, dataset_text_field="text", max_seq_length=4096, ) trainer.train() ``` ```python # 6. HuggingFace Hub: 데이터셋·모델 공유 from huggingface_hub import HfApi, snapshot_download # 다운로드 snapshot_download("microsoft/Phi-5-mini-instruct", local_dir="./phi5") # 업로드 (모델 push) api = HfApi() api.upload_folder( folder_path="./my-finetuned", repo_id="myuser/my-llama-finetune", repo_type="model", ) ``` ```bash # 7. lm-eval-harness — 표준 벤치마크 pip install lm-eval lm_eval --model hf \ --model_args pretrained=Qwen/Qwen3-8B-Instruct \ --tasks mmlu,gsm8k,arc_challenge \ --batch_size 8 --output_path results/ ``` ```python # 8. MTEB — embedding 모델 평가 from mteb import MTEB from sentence_transformers import SentenceTransformer m = SentenceTransformer("BAAI/bge-large-en-v1.5") MTEB(tasks=["STSBenchmark", "Banking77Classification"]).run(m, output_folder="mteb_out") ``` ## 결정 기준 | 시나리오 | 추천 스택 | |---|---| | 로컬 실험·데모 | Ollama / LM Studio | | Mac에서 빠르게 | MLX + Llama 4 | | CPU only / Edge | llama.cpp (GGUF q4) | | 프로덕션 서빙 | vLLM + tensor-parallel | | 적은 GPU 수 fine-tune | Unsloth + QLoRA | | 멀티노드 학습 | axolotl / LLaMA-Factory | | Embedding/RAG | bge / e5 / nomic-embed | | 평가 | lm-eval-harness, MTEB | ## 🔗 Graph - Related: ``, ``, `[[LLM_Optimization_and_Deployment_Strategies|Quantization]]`, `[[RAG]]`, ``, `[[LLM_Optimization_and_Deployment_Strategies|vLLM]]`, `[[Ollama]]`, `[[LoRA]]` ## 🤖 LLM 활용 - API 비용 절감: 자주 쓰는 routine task는 Qwen3-8B 로컬 + GPT-5는 어려운 경우만. - 데이터 민감도: 의료/금융 → 온프렘 vLLM. - Fine-tune 가능 = 도메인 적응 (closed model보다 큰 강점). ## ❌ 안티패턴 - Llama 라이센스 조건 미확인하고 상용 배포 (>700M MAU 제한). - 양자화 q2 사용 후 품질 폭락 무시. - vLLM 없이 transformers `generate()`로 프로덕션 서빙 (느림). - HuggingFace에 secrets 포함 모델 push. ## 🧪 검증 - `ollama run` 후 응답 시간 < 5s. - vLLM `--max-num-seqs` 늘려도 latency 안정. - lm-eval로 base vs fine-tune 차이 정량화. ## 🕓 Changelog - 2026-05-08 Phase 1: 초안. - 2026-05-10 Manual cleanup: 8개 패턴, vLLM/Unsloth/MTEB 갱신.