--- id: [[P-Reinforce|P-Reinforce]]-AUTO-VLLM-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, vllm, llm-serving, throughput-optimization, paged-attention] last_reinforced: 2026-05-04 --- # [[vLLM|vLLM]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์„œ๋น™ ์„ฑ๋Šฅ์˜ ๊ฒŒ์ž„ ์ฒด์ธ์ €: PagedAttention์„ ์„ธ๊ณ„ ์ตœ์ดˆ๋กœ ๋„์ž…ํ•˜์—ฌ, ๊ธฐ์กด ์‹œ์Šคํ…œ ๋Œ€๋น„ 10~20๋ฐฐ ์ด์ƒ์˜ ๋™์‹œ ์ฒ˜๋ฆฌ๋Ÿ‰(Throughput)์„ ๋‹ฌ์„ฑํ•˜๋ฉฐ LLM ์‹ค์šฉ ์„œ๋น„์Šค ์‹œ๋Œ€๋ฅผ ์•ž๋‹น๊ธด ํ‘œ์ค€ ์ถ”๋ก  ์—”์ง„." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) vLLM(Virtual Large Language Model)์€ ๊ณ ์„ฑ๋Šฅ LLM ์ถ”๋ก  ๋ฐ ์„œ๋น™์„ ์œ„ํ•ด ์„ค๊ณ„๋œ ์˜คํ”ˆ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ๊ณผ ์ฒ˜๋ฆฌ๋Ÿ‰ ๊ทน๋Œ€ํ™”์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. 1. **ํ•ต์‹ฌ ๊ธฐ์ˆ **: * **[[PagedAttention|PagedAttention]]**: ๋ฉ”๋ชจ๋ฆฌ ๋‹จํŽธํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ KV ์บ์‹œ ํ™œ์šฉ๋ฅ ์„ ํš๊ธฐ์ ์œผ๋กœ ๋†’์˜€์Šต๋‹ˆ๋‹ค. * **Continuous Batching**: ๋ชจ๋“  ์š”์ฒญ์ด ๋๋‚  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฌ์ง€ ์•Š๊ณ , ๊ฐœ๋ณ„ ํ† ํฐ ์ƒ์„ฑ์ด ์™„๋ฃŒ๋  ๋•Œ๋งˆ๋‹ค ์ƒˆ๋กœ์šด ์š”์ฒญ์„ ๋ฐฐ์น˜์— ๋ผ์›Œ ๋„ฃ์–ด GPU ๊ฐ€๋™๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. 2. **์ฃผ์š” ํŠน์ง•**: * **๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰**: Hugging Face Transformers๋‚˜ Text Generation Inference(TGI) ๋Œ€๋น„ ์›”๋“ฑํ•œ ์ฒ˜๋ฆฌ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. * **๋ฒ”์šฉ์„ฑ**: Llama, Mistral, Gemma ๋“ฑ ๋Œ€๋ถ€๋ถ„์˜ ์ตœ์‹  ์˜คํ”ˆ์†Œ์Šค ๋ชจ๋ธ์„ ์ง€์›ํ•˜๋ฉฐ, OpenAI ํ˜ธํ™˜ API๋ฅผ ์ œ๊ณตํ•˜์—ฌ ์—ฐ๋™์ด ์‰ฝ์Šต๋‹ˆ๋‹ค. 3. **์˜์˜**: * ์ƒ์šฉ ์ˆ˜์ค€์˜ LLM ์„œ๋น„์Šค๋ฅผ ๊ตฌ์ถ•ํ•  ๋•Œ ๊ฐ€์žฅ ๋จผ์ € ๊ณ ๋ ค๋˜๋Š” ํ‘œ์ค€ ํ”„๋ ˆ์ž„์›Œํฌ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **VRAM ์ ์œ **: ์„ฑ๋Šฅ์„ ์œ„ํ•ด ๊ฐ€์šฉ VRAM์˜ ๋Œ€๋ถ€๋ถ„์„ KV ์บ์‹œ์šฉ์œผ๋กœ ์„ ์ (Pre-allocation)ํ•˜๋ฏ€๋กœ, ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค์™€ GPU๋ฅผ ๊ณต์œ ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. * **TTFT vs Throughput**: ์ „์ฒด ์ฒ˜๋ฆฌ๋Ÿ‰์€ ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ๊ทน๋‹จ์ ์ธ ๋ฐฐ์น˜ ์ƒํ™ฉ์—์„œ๋Š” ์ฒซ ํ† ํฐ ์ƒ์„ฑ ์‹œ๊ฐ„(Time-to-First-Token)์ด ์†Œํญ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **ํ•ต์‹ฌ ๊ธฐ๋ฐ˜**: [[PagedAttention|PagedAttention]], [[Key-Value (KV) Cache|Key-Value (KV) Cache]] * **๊ฒฝ์Ÿ/๋Œ€์•ˆ ๊ธฐ์ˆ **: [[TensorRT-LLM|TensorRT-LLM]], [[TGI|TGI]], [[Ollama|Ollama]] * **์ตœ์ ํ™” ๊ธฐ๋ฒ•**: [[Quantization|Quantization]], [[Speculative Decoding|Speculative Decoding]] --- *Last updated: 2026-05-04*