--- id: [[P-Reinforce|P-Reinforce]]-AUTO-KVCP-001 category: Unified confidence_score: 1.00 tags: [auto-reinforced, kv-cache-compression, attention-optimization, thin-kv, eviction-policy] last_reinforced: 2026-05-04 --- # [[KV Cache Compression|KV Cache Compression]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๊ธฐ์–ต์˜ ๋‹ค์ด์–ดํŠธ: ๋ชจ๋“  ์ •๋ณด๋ฅผ ๋ฌด์ž‘์ • ๋“ค๊ณ  ์žˆ๋Š” ๋Œ€์‹ , ๋งฅ๋ฝ์— ๋œ ์ค‘์š”ํ•œ ํ† ํฐ์„ ์„ ๋ณ„์ ์œผ๋กœ ์‚ญ์ œํ•˜๊ฑฐ๋‚˜ ์••์ถ•ํ•จ์œผ๋กœ์จ ํ•œ์ •๋œ VRAM ์•ˆ์—์„œ ๋ฌดํ•œ์— ๊ฐ€๊นŒ์šด ๋ฌธ๋งฅ์„ ์ˆ˜์šฉํ•˜๋ ค๋Š” ๊ณ ๋„์˜ ์ตœ์ ํ™” ์ „๋žต." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) KV ์บ์‹œ ์••์ถ•(KV Cache Compression)์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์—ฌ ๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋†’์ด๊ธฐ ์œ„ํ•ด, ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ KV ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ์š”์•ฝํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. 1. **์ฃผ์š” ์ „๋žต**: * **์ถ•์ถœ (Eviction)**: ์–ดํ…์…˜ ์ ์ˆ˜๊ฐ€ ๋‚ฎ๊ฑฐ๋‚˜ ์ •๋ณด ๊ฐ€์น˜๊ฐ€ ์ ์€ ํ† ํฐ์˜ K, V ๊ฐ’์„ ์บ์‹œ์—์„œ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: StreamingLLM, H2O) * **๋ณ‘ํ•ฉ (Merging/Pooling)**: ์œ ์‚ฌํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ํ† ํฐ์˜ KV ๊ฐ’์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์„œ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. * **๋™์  ์„ ํƒ**: ์ถ”๋ก  ์‹œ ๋ชจ๋ธ์ด ์Šค์Šค๋กœ ์–ด๋–ค ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ•˜๊ณ  ์–ด๋–ค ์ •๋ณด๋ฅผ ์žŠ์„์ง€ ๊ฒฐ์ •ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. 2. **ThinKV (์ตœ์‹  ์‚ฌ๋ก€)**: * ๋…ผ๋ฆฌ์  '์ƒ๊ฐ(Thought)'์˜ ์ค‘์š”๋„์— ๋”ฐ๋ผ ๋œ ์ค‘์š”ํ•œ KV ์บ์‹œ ํ† ํฐ์„ ์„ ์ œ์ ์œผ๋กœ ๋น„์šฐ๊ณ , ๋ณ„๋„์˜ ์••์ถ• ์˜ค๋ฒ„ํ—ค๋“œ ์—†์ด ๋ฉ”๋ชจ๋ฆฌ ์Šฌ๋กฏ์„ ์ œ์ž๋ฆฌ์—์„œ ์žฌ์‚ฌ์šฉ(In-place reuse)ํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์••์ถ• ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. 3. **์žฅ์ **: * ๋ฉ”๋ชจ๋ฆฌ ํ’‹ํ”„๋ฆฐํŠธ๋ฅผ 50%~90% ์ด์ƒ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * ํ•˜๋“œ์›จ์–ด ์ฆ์„ค ์—†์ด ์†Œํ”„ํŠธ์›จ์–ด๋งŒ์œผ๋กœ ๋” ๊ธด ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ๋ฅผ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค. ## โš–๏ธ Trade-offs & Caveats * **์ •ํ™•๋„ ์†์‹ค**: ์ค‘์š”ํ•œ ํ† ํฐ์ด ์ถ•์ถœ๋  ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋…ผ๋ฆฌ๊ฐ€ ๊นจ์ง€๊ฑฐ๋‚˜ ํ™˜๊ฐ(Hallucination)์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * **์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ**: ์–ด๋–ค ํ† ํฐ์„ ๋ฒ„๋ฆด์ง€ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ • ์ž์ฒด๊ฐ€ ์ถ”๊ฐ€์ ์ธ ์ง€์—ฐ ์‹œ๊ฐ„(Latency)์„ ๋ฐœ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[Key-Value (KV) Cache|Key-Value (KV) Cache]] * **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[Sparse Attention|Sparse Attention]], [[KV Cache Quantization|KV Cache Quantization]], [[ThinKV|ThinKV]], [[StreamingLLM|StreamingLLM]] --- *Last updated: 2026-05-04*