--- id: wiki-2026-0508-key-value-kv-cache title: Key-Value (KV) Cache category: 10_Wiki/Topics status: verified canonical_id: self aliases: [KV cache, key-value cache, paged attention, vLLM, prefix caching] duplicate_of: none source_trust_level: A confidence_score: 0.96 verification_status: applied tags: [llm, kv-cache, inference, paged-attention, vllm, optimization] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python / CUDA framework: vLLM / TensorRT-LLM / SGLang --- # Key-Value (KV) Cache ## 매 한 줄 > **"매 transformer inference 의 의 의 의 의 K, V 의 store"**. 매 매 token generate 의 매 quadratic → 매 linear 의 의 의 의 의 enable. 매 modern: 매 paged attention (vLLM Kwon 2023), 매 prefix cache, 매 quantized KV. ## 매 핵심 ### 매 motivation - **Without cache**: 매 매 step 의 의 의 의 모든 prev token 의 attention 의 recompute → O(N²). - **With cache**: 매 매 K, V 의 store → 매 single new token 의 attention → O(N). - **Memory cost**: 매 batch × seq × layers × kv_heads × head_dim × 2 (K + V) × dtype. ### 매 응용 1. Inference acceleration. 2. Long-context generation. 3. Streaming response. 4. Multi-turn chat. ### 매 modern technique - **Paged Attention** (vLLM 2023): 매 OS-style page. - **Prefix caching**: 매 same prefix 의 의 reuse. - **GQA + KV** (Llama): 매 cache size ↓. - **MLA** (DeepSeek-V2): 매 latent compression. - **Quantized KV** (FP8, INT4). - **Sliding window** (Mistral): 매 truncate. ## 💻 패턴 ### Basic KV cache (educational) ```python import torch class CachedAttention: def __init__(self): self.k_cache = [] # 매 list of [batch, n_heads, head_dim] self.v_cache = [] def forward(self, q_new, k_new, v_new): # 매 append new self.k_cache.append(k_new) self.v_cache.append(v_new) # 매 stack k_all = torch.stack(self.k_cache, dim=2) # 매 [B, H, T, D] v_all = torch.stack(self.v_cache, dim=2) # 매 attention with new q only (1 token) attn = (q_new @ k_all.transpose(-1, -2) / k_all.size(-1) ** 0.5).softmax(-1) return attn @ v_all ``` ### Memory size calculation ```python def kv_cache_bytes(batch, seq, layers, kv_heads, head_dim, dtype_bytes=2): return batch * seq * layers * kv_heads * head_dim * 2 * dtype_bytes # 매 Llama 70B GQA: 80 layers, 8 KV heads, 128 head_dim # 매 batch=1, seq=4096, bf16 size_gb = kv_cache_bytes(1, 4096, 80, 8, 128, 2) / 1e9 # 매 ≈ 1.34 GB ``` ### vLLM (production) ```python from vllm import LLM, SamplingParams llm = LLM(model='meta-llama/Llama-3.1-70B-Instruct', max_num_seqs=64) sampling = SamplingParams(temperature=0.7, max_tokens=512) outputs = llm.generate(prompts, sampling) # 매 internally: paged attention + dynamic batching ``` ### Prefix cache (vLLM) ```python # 매 same system prompt 의 의 의 cache 의 reuse # 매 vLLM 의 enable_prefix_caching=True llm = LLM(model='...', enable_prefix_caching=True) ``` ### Anthropic prompt caching ```python from anthropic import Anthropic client = Anthropic() response = client.messages.create( model='claude-opus-4-7', max_tokens=1024, system=[ {'type': 'text', 'text': 'You are an expert.', 'cache_control': {'type': 'ephemeral'}}, {'type': 'text', 'text': long_book_content, 'cache_control': {'type': 'ephemeral'}}, ], messages=[{'role': 'user', 'content': question}], ) # 매 second call with same prefix = 매 90% cost ↓ ``` ### Sliding window (Mistral-style) ```python def sliding_window_kv(k_cache, v_cache, window=4096): if k_cache.size(2) > window: k_cache = k_cache[:, :, -window:] v_cache = v_cache[:, :, -window:] return k_cache, v_cache ``` ### GQA-aware cache ```python def gqa_kv_cache_size(batch, seq, layers, n_q_heads, n_kv_heads, head_dim): """매 GQA: 매 KV heads < Q heads → 매 cache size ↓.""" return batch * seq * layers * n_kv_heads * head_dim * 2 # 매 Llama 70B: 매 8 KV vs 64 Q = 매 8x ↓ cache ``` ### MLA (DeepSeek) ```python class MLA: """매 매 K, V 의 의 의 의 의 of low-rank latent.""" def __init__(self, d_model, d_latent=512): self.W_dkv = nn.Linear(d_model, d_latent) # 매 cache 매 latent only self.W_uk = nn.Linear(d_latent, d_model) # 매 reconstruct K self.W_uv = nn.Linear(d_latent, d_model) # 매 reconstruct V def forward(self, x, kv_cache): c = self.W_dkv(x) kv_cache.append(c) # 매 small # 매 reconstruct on-the-fly ``` ### KV quantization ```python def quantize_kv(k, v, bits=8): """매 K, V 의 의 의 INT8 / FP8 의 의 의 store.""" k_int8 = (k * 127 / k.abs().max()).to(torch.int8) v_int8 = (v * 127 / v.abs().max()).to(torch.int8) return k_int8, v_int8, k.abs().max(), v.abs().max() # scale factors ``` ### Streaming generation ```python def stream_generate(model, prompt, max_tokens=200): input_ids = tokenize(prompt) kv_cache = None # 매 prefill out, kv_cache = model.forward(input_ids, kv_cache=None) for _ in range(max_tokens): next_token = sample(out[:, -1]) yield next_token # 매 매 single token forward out, kv_cache = model.forward(next_token.unsqueeze(0), kv_cache=kv_cache) ``` ### Continuous batching (vLLM) ```python # 매 매 request 의 different stage 의 같이 batch # 매 vLLM 의 매 (Yu 2022 Orca) # 매 prefill + decode 의 mix class ContinuousBatcher: def step(self, requests): # 매 prefill new + decode existing return run_iteration(requests) ``` ### Memory profile (debug) ```python def profile_kv_cache(model, seq_lens): for seq in seq_lens: torch.cuda.reset_peak_memory_stats() out = model.generate(input_ids[:1, :seq]) peak = torch.cuda.max_memory_allocated() / 1e9 print(f'seq={seq}: peak {peak:.2f} GB') ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Production serving | vLLM (paged) | | Long context | GQA + MLA | | Multi-turn chat | Prefix caching | | Memory tight | Quantized KV | | Very long | Sliding window (Mistral) | | API | Anthropic prompt cache | **기본값**: 매 vLLM + 매 GQA + 매 prefix caching + 매 dynamic batching. 매 long context = quantized KV / MLA. ## 🔗 Graph - 부모: [[Transformer]] - 변형: [[Paged-Attention]] - 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] - Adjacent: [[Flash Attention]] · [[Grouped-Query Attention (GQA)]] · [[Foundation-Models]] ## 🤖 LLM 활용 **언제**: 매 inference. 매 모든 production LLM serving. **언제 X**: 매 single-shot research only. ## ❌ 안티패턴 - **No KV cache**: 매 quadratic — unusable for long. - **Static batching**: 매 GPU underutil. - **Full-precision KV at scale**: 매 memory waste. - **Recompute prefix every call**: 매 cost ↑. ## 🧪 검증 / 중복 - Verified (Vaswani 2017, Kwon vLLM 2023, Yu Orca 2022, DeepSeek-V2 MLA). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — KV cache + 매 vLLM / prefix / MLA / sliding / quantize code |