--- id: [[P-Reinforce|P-Reinforce]]-AUTO-KVCH-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, kv-cache, transformer-inference, memory-bottleneck, llm-performance] last_reinforced: 2026-05-04 --- # [[Key-Value (KV) Cache|Key-Value (KV) Cache]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๋ชจ๋ธ์˜ ๋‹จ๊ธฐ ๊ธฐ์–ต ์žฅ์น˜: ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์ถ”๋ก  ๊ณผ์ •์—์„œ ์ด์ „ ํ† ํฐ๋“ค์˜ ์—ฐ์‚ฐ ๊ฒฐ๊ณผ(Key, Value)๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ด๋‘๊ณ  ์žฌ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, ๋งค๋ฒˆ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋Š” ๋‚ญ๋น„๋ฅผ ์—†์• ๊ณ  ์ƒ์„ฑ ์†๋„๋ฅผ ๋น„์•ฝ์ ์œผ๋กœ ๋†’์ธ ์ถ”๋ก  ์ตœ์ ํ™”์˜ ์‹ฌ์žฅ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) KV ์บ์‹œ(Key-Value Cache)๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์ด ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ๋•Œ, ์ด๋ฏธ ์ฒ˜๋ฆฌ๋œ ํ† ํฐ๋“ค์˜ Key์™€ Value ํ–‰๋ ฌ์„ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ด๋‘๋Š” ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ž๊ธฐํšŒ๊ท€(Autoregressive) ์ƒ์„ฑ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ค‘๋ณต ์—ฐ์‚ฐ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. 1. **ํ•„์š”์„ฑ**: * ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•  ๋•Œ ์ด์ „์˜ ๋ชจ๋“  ํ† ํฐ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. * KV ์บ์‹œ๊ฐ€ ์—†๋‹ค๋ฉด $n$๋ฒˆ์งธ ํ† ํฐ์„ ์ƒ์„ฑํ•  ๋•Œ $1$๋ถ€ํ„ฐ $n-1$๊นŒ์ง€์˜ ํ† ํฐ์„ ๋งค๋ฒˆ ๋‹ค์‹œ ์—ฐ์‚ฐํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ์‹œํ€€์Šค๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. 2. **์ž‘๋™ ์›๋ฆฌ**: * **Prefill ๋‹จ๊ณ„**: ์ž…๋ ฅ๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋ฉฐ ๋ชจ๋“  ํ† ํฐ์˜ K, V ๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ ์บ์‹œ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. * **Decoding ๋‹จ๊ณ„**: ์ƒˆ๋กœ์šด ํ† ํฐ์„ ํ•˜๋‚˜์”ฉ ์ƒ์„ฑํ•  ๋•Œ๋งˆ๋‹ค ํ•ด๋‹น ํ† ํฐ์˜ K, V ๊ฐ’๋งŒ ๊ณ„์‚ฐํ•˜์—ฌ ์บ์‹œ์— ์ถ”๊ฐ€ํ•˜๊ณ , ์ด์ „ ๊ฐ’๋“ค์€ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. 3. **๋ณ‘๋ชฉ ํ˜„์ƒ**: * **๋ฉ”๋ชจ๋ฆฌ ์••๋ฐ•**: ์ปจํ…์ŠคํŠธ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก KV ์บ์‹œ๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” VRAM ์šฉ๋Ÿ‰์ด ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: ์ˆ˜์ฒœ ๋ช…์˜ ์‚ฌ์šฉ์ž๊ฐ€ ๋™์‹œ์— ๊ธด ๋Œ€ํ™”๋ฅผ ๋‚˜๋ˆŒ ๊ฒฝ์šฐ GPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ(OOM) ๋ฐœ์ƒ ์›์ธ 1์ˆœ์œ„) * **I/O ๋ณ‘๋ชฉ**: ์—ฐ์‚ฐ ์ž์ฒด๋ณด๋‹ค ์บ์‹œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ฝ์–ด์˜ค๋Š” ์†๋„(Memory Bandwidth)๊ฐ€ ์ถ”๋ก  ์†๋„๋ฅผ ๊ฒฐ์ •ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **์šฉ๋Ÿ‰ vs ์†๋„**: ์บ์‹œ๋ฅผ ๋งŽ์ด ํ•˜๋ฉด ์†๋„๋Š” ๋นจ๋ผ์ง€์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•ด์ง€๊ณ , ์บ์‹œ๋ฅผ ์ค„์ด๋ฉด(Compression/Quantization) ๋” ๊ธด ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์ •ํ™•๋„๊ฐ€ ์†Œํญ ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * **๋‹จํŽธํ™” ๋ฌธ์ œ**: ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ฏธ๋ฆฌ ํ• ๋‹นํ•  ๊ฒฝ์šฐ, ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๋นˆ ๊ณต๊ฐ„์ด ๋ฐœ์ƒํ•˜๋Š” '๋ฉ”๋ชจ๋ฆฌ ๋‹จํŽธํ™”' ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด [[PagedAttention|PagedAttention]]์ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[Attention Mechanisms|Attention Mechanisms]], [[LLM Inference Optimization|LLM Inference Optimization]] * **์ตœ์ ํ™” ๊ธฐ์ˆ **: [[PagedAttention|PagedAttention]], [[KV Cache Compression|KV Cache Compression]], [[KV Cache Quantization|KV Cache Quantization]], [[Grouped-Query Attention (GQA)|GQA]] * **ํ”„๋ ˆ์ž„์›Œํฌ**: [[vLLM|vLLM]], [[TensorRT-LLM|TensorRT-LLM]] --- *Last updated: 2026-05-04*