--- id: wiki-2026-0508-kv-cache-compression title: KV Cache Compression category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [P-Reinforce-AUTO-KVCP-001] duplicate_of: none source_trust_level: A confidence_score: 1.0 tags: [auto-reinforced, kv-cache-compression, attention-optimization, thin-kv, eviction-policy] raw_sources: [] last_reinforced: 2026-05-04 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) tech_stack: language: unspecified framework: unspecified --- # [[KV Cache Compression|KV Cache Compression]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "๊ธฐ์–ต์˜ ๋‹ค์ด์–ดํŠธ: ๋ชจ๋“  ์ •๋ณด๋ฅผ ๋ฌด์ž‘์ • ๋“ค๊ณ  ์žˆ๋Š” ๋Œ€์‹ , ๋งฅ๋ฝ์— ๋œ ์ค‘์š”ํ•œ ํ† ํฐ์„ ์„ ๋ณ„์ ์œผ๋กœ ์‚ญ์ œํ•˜๊ฑฐ๋‚˜ ์••์ถ•ํ•จ์œผ๋กœ์จ ํ•œ์ •๋œ VRAM ์•ˆ์—์„œ ๋ฌดํ•œ์— ๊ฐ€๊นŒ์šด ๋ฌธ๋งฅ์„ ์ˆ˜์šฉํ•˜๋ ค๋Š” ๊ณ ๋„์˜ ์ตœ์ ํ™” ์ „๋žต." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) KV ์บ์‹œ ์••์ถ•(KV Cache Compression)์€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์—ฌ ๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ฑฐ๋‚˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋†’์ด๊ธฐ ์œ„ํ•ด, ์ค‘์š”๋„๊ฐ€ ๋‚ฎ์€ KV ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•˜๊ฑฐ๋‚˜ ์š”์•ฝํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. 1. **์ฃผ์š” ์ „๋žต**: * **์ถ•์ถœ (Eviction)**: ์–ดํ…์…˜ ์ ์ˆ˜๊ฐ€ ๋‚ฎ๊ฑฐ๋‚˜ ์ •๋ณด ๊ฐ€์น˜๊ฐ€ ์ ์€ ํ† ํฐ์˜ K, V ๊ฐ’์„ ์บ์‹œ์—์„œ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค. (์˜ˆ: StreamingLLM, H2O) * **๋ณ‘ํ•ฉ (Merging/Pooling)**: ์œ ์‚ฌํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ํ† ํฐ์˜ KV ๊ฐ’์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์„œ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. * **๋™์  ์„ ํƒ**: ์ถ”๋ก  ์‹œ ๋ชจ๋ธ์ด ์Šค์Šค๋กœ ์–ด๋–ค ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ•˜๊ณ  ์–ด๋–ค ์ •๋ณด๋ฅผ ์žŠ์„์ง€ ๊ฒฐ์ •ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. 2. **ThinKV (์ตœ์‹  ์‚ฌ๋ก€)**: * ๋…ผ๋ฆฌ์  '์ƒ๊ฐ(Thought)'์˜ ์ค‘์š”๋„์— ๋”ฐ๋ผ ๋œ ์ค‘์š”ํ•œ KV ์บ์‹œ ํ† ํฐ์„ ์„ ์ œ์ ์œผ๋กœ ๋น„์šฐ๊ณ , ๋ณ„๋„์˜ ์••์ถ• ์˜ค๋ฒ„ํ—ค๋“œ ์—†์ด ๋ฉ”๋ชจ๋ฆฌ ์Šฌ๋กฏ์„ ์ œ์ž๋ฆฌ์—์„œ ์žฌ์‚ฌ์šฉ(In-place reuse)ํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์••์ถ• ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. 3. **์žฅ์ **: * ๋ฉ”๋ชจ๋ฆฌ ํ’‹ํ”„๋ฆฐํŠธ๋ฅผ 50%~90% ์ด์ƒ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * ํ•˜๋“œ์›จ์–ด ์ฆ์„ค ์—†์ด ์†Œํ”„ํŠธ์›จ์–ด๋งŒ์œผ๋กœ ๋” ๊ธด ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ๋ฅผ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & Updates) * **์ •ํ™•๋„ ์†์‹ค**: ์ค‘์š”ํ•œ ํ† ํฐ์ด ์ถ•์ถœ๋  ๊ฒฝ์šฐ ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋…ผ๋ฆฌ๊ฐ€ ๊นจ์ง€๊ฑฐ๋‚˜ ํ™˜๊ฐ(Hallucination)์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. * **์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ**: ์–ด๋–ค ํ† ํฐ์„ ๋ฒ„๋ฆด์ง€ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ • ์ž์ฒด๊ฐ€ ์ถ”๊ฐ€์ ์ธ ์ง€์—ฐ ์‹œ๊ฐ„(Latency)์„ ๋ฐœ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) * **์ƒ์œ„ ๊ฐœ๋…**: [[Key-Value (KV) Cache|Key-Value (KV) Cache]] * **์—ฐ๊ด€ ๊ธฐ์ˆ **: [[Sparse Attention|Sparse Attention]], [[KV Cache Quantization|KV Cache Quantization]], [[ThinKV|ThinKV]], [[StreamingLLM|StreamingLLM]] --- *Last updated: 2026-05-04* ## ๐Ÿค– LLM ํ™œ์šฉ ํžŒํŠธ (How to Use This Knowledge) **์–ธ์ œ ์ด ์ง€์‹์„ ์“ฐ๋Š”๊ฐ€:** - *(TODO)* **์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€:** - *(TODO)* ## ๐Ÿงช ๊ฒ€์ฆ ์ƒํƒœ (Validation) - **์ •๋ณด ์ƒํƒœ:** needs_review - **์ถœ์ฒ˜ ์‹ ๋ขฐ๋„:** A - **๊ฒ€ํ†  ์ด์œ :** *(P-Reinforce Phase 1 ์ž๋™ ์ •๊ทœํ™”. ๋ณธ๋ฌธ ๊ฒ€์ฆ ํ•„์š”.)* ## ๐Ÿงฌ ์ค‘๋ณต ๊ฒ€์‚ฌ (Duplicate Check) - **๊ธฐ์กด ์œ ์‚ฌ ๋ฌธ์„œ:** *(TODO: ์ธ๋ฑ์„œ ํด๋Ÿฌ์Šคํ„ฐ ๋ฆฌํฌํŠธ ์ฐธ์กฐ)* - **์ฒ˜๋ฆฌ ๋ฐฉ์‹:** UPDATE (์ž๋™ ์ •๊ทœํ™”) - **์ฒ˜๋ฆฌ ์ด์œ :** Phase 1 ์ •๊ทœํ™” โ€” ์˜› ํ…œํ”Œ๋ฆฟ/๋ˆ„๋ฝ ํ•„๋“œ ๋ณด๊ฐ•. ## ๐Ÿ•“ ๋ณ€๊ฒฝ ์ด๋ ฅ (Changelog) | ๋‚ ์งœ | ๋ณ€๊ฒฝ ๋‚ด์šฉ | ์ฒ˜๋ฆฌ ๋ฐฉ์‹ | ์‹ ๋ขฐ๋„ | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 ์ •๊ทœํ™” (frontmatter + ํ—ค๋” ํ‘œ์ค€ํ™”) | UPDATE | A | ## ๐Ÿ’ป ์ฝ”๋“œ ํŒจํ„ด (Code Patterns) **ํŒจํ„ด 1:** *(TODO: ์ด ํ”„๋กœ์ ํŠธ ์ปจ๋ฒค์…˜ ๋ฐ˜์˜ํ•œ ๊ตฌ์กฐ ์Šค์ผˆ๋ ˆํ†ค)* ```text # TODO ``` ## ๐Ÿค” ์˜์‚ฌ๊ฒฐ์ • ๊ธฐ์ค€ (Decision Criteria) **์„ ํƒ A๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **์„ ํƒ B๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **๊ธฐ๋ณธ๊ฐ’:** > *(TODO)* ## โŒ ์•ˆํ‹ฐํŒจํ„ด (Anti-Patterns) - **[์•ˆํ‹ฐํŒจํ„ด]:** *(TODO: ๋ฌด์—‡์„ ํ•˜๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€ + ์ด์œ  + ๋Œ€์‹  ๋ฌด์—‡์„)*