--- id: [[P-Reinforce|P-Reinforce]]-AUTO-DFWK-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, vllm, tensorrt-llm, ollama, serving, inference-engine] last_reinforced: 2026-05-04 --- # [[Deployment Frameworks|Deployment Frameworks]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ตœ์‹  AI ๊ธฐ์ˆ ์˜ ์‹ค์ „ ๋ฐฐ์น˜ ์‚ฌ๋ น๋ถ€: ์—ฐ๊ตฌ ๋‹จ๊ณ„์˜ ๋ชจ๋ธ์„ ์‹ค์ œ ์„œ๋น„์Šค๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์œผ๋กœ ๊ฐ€์†ํ•˜๊ณ , ์ˆ˜์ฒœ ๋ช…์˜ ๋™์‹œ ์ ‘์†์ž๋ฅผ ๊ฐ๋‹นํ•  ์ˆ˜ ์žˆ๋„๋ก ์ธํ”„๋ผ์™€ ์†Œํ”„ํŠธ์›จ์–ด๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๊ณ ์„ฑ๋Šฅ ์ถ”๋ก  ์—”์ง„." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ๋‹ค์–‘ํ•œ ํ•˜๋“œ์›จ์–ด ํ™˜๊ฒฝ์—์„œ LLM์„ ํšจ์œจ์ ์œผ๋กœ ๊ตฌ๋™ํ•˜๊ณ  ์„œ๋น™ํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์ ํ™”๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋“ค์ž…๋‹ˆ๋‹ค. 1. **[[vLLM|vLLM]]**: * **๊ฐ•์ **: [[PagedAttention|PagedAttention]] ๊ธฐ์ˆ ์˜ ์„ ๊ตฌ์ž๋กœ, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ๊ณผ ์ฒ˜๋ฆฌ๋Ÿ‰(Throughput)์ด ๋งค์šฐ ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค. ์˜คํ”ˆ์†Œ์Šค ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. * **์ ํ•ฉ**: ๋ฒ”์šฉ์ ์ธ LLM ์„œ๋น™, ๋‹ค์ค‘ ์‚ฌ์šฉ์ž ์š”์ฒญ ์ฒ˜๋ฆฌ. 2. **TensorRT-LLM (NVIDIA)**: * **๊ฐ•์ **: NVIDIA ํ•˜๋“œ์›จ์–ด์— ์ตœ์ ํ™”๋œ ์ €์ˆ˜์ค€ ๊ฐ€์† ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. C++ ๊ธฐ๋ฐ˜์˜ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ๊ณผ ๊ณ ๋„์˜ ์ปค๋„ ์ตœ์ ํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. * **์ ํ•ฉ**: ์—”ํ„ฐํ”„๋ผ์ด์ฆˆ ๊ธ‰ ๊ณ ์„ฑ๋Šฅ ์„œ๋น„์Šค, NVIDIA ์ „์šฉ ํด๋ผ์šฐ๋“œ ์ธํ”„๋ผ. 3. **Ollama**: * **๊ฐ•์ **: ๋ณต์žกํ•œ ์„ค์ • ์—†์ด ๋กœ์ปฌ PC(macOS, Linux, Windows)์—์„œ LLM์„ ์ฆ‰์‹œ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ์‚ฌ์šฉ์ž ์นœํ™”์  ๋„๊ตฌ์ž…๋‹ˆ๋‹ค. * **์ ํ•ฉ**: ๋กœ์ปฌ ๊ฐœ๋ฐœ, ๊ฐœ์ธ์šฉ AI ์–ด์‹œ์Šคํ„ดํŠธ, ๊ฒฝ๋Ÿ‰ ํ…Œ์ŠคํŠธ ํ™˜๊ฒฝ. 4. **TGI (Text Generation Inference)**: * **๊ฐ•์ **: Hugging Face์—์„œ ๊ฐœ๋ฐœํ•œ ํ”„๋กœ๋•์…˜์šฉ ์ถ”๋ก  ์—”์ง„์œผ๋กœ, ์•ˆ์ •์„ฑ๊ณผ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ์ง€์›์ด ํŠน์ง•์ž…๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **์œ ์—ฐ์„ฑ vs ์„ฑ๋Šฅ**: Ollama๋Š” ์‚ฌ์šฉํ•˜๊ธฐ ๋งค์šฐ ์‰ฝ์ง€๋งŒ ๋ฏธ์„ธํ•œ ํŠœ๋‹์ด ์–ด๋ ต๊ณ , TensorRT-LLM์€ ์„ฑ๋Šฅ์€ ์ตœ๊ฐ•์ด์ง€๋งŒ ๋นŒ๋“œ ๊ณผ์ •๊ณผ ์„ค์ •์ด ๋งค์šฐ ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. * **ํ•˜๋“œ์›จ์–ด ์ข…์†์„ฑ**: TensorRT-LLM์€ NVIDIA GPU์—์„œ๋งŒ ์ž‘๋™ํ•˜๋ฉฐ, vLLM์€ AMD GPU ์ง€์›์„ ํ™•์žฅ ์ค‘์ด์ง€๋งŒ ์—ฌ์ „ํžˆ NVIDIA ์ตœ์ ํ™”๊ฐ€ ์ฃผ๋ฅผ ์ด๋ฃน๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **ํ•ต์‹ฌ ๊ธฐ์ˆ **: [[PagedAttention|PagedAttention]], [[Continuous Batching|Continuous Batching]], [[Quantization|Quantization]] * **๊ด€๋ จ ์ธํ”„๋ผ**: [[GPU Infrastructure|GPU Infrastructure]], [[Docker|Docker]] * **ํ”„๋กœ์ ํŠธ ์ ์šฉ**: ๋กœ์ปฌ ๊ฐœ๋ฐœ์šฉ ์—์ด์ „ํŠธ([[Ollama|Ollama]]), ๊ณ ์„ฑ๋Šฅ RAG ์„œ๋น™ ์—”์ง„([[vLLM|vLLM]]) --- *Last updated: 2026-05-04*