Files
2nd/10_Wiki/Topics/AI_and_ML/Key-Value (KV) Cache.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

232 lines
7.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-key-value-kv-cache
title: Key-Value (KV) Cache
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [KV cache, key-value cache, paged attention, vLLM, prefix caching]
duplicate_of: none
source_trust_level: A
confidence_score: 0.96
verification_status: applied
tags: [llm, kv-cache, inference, paged-attention, vllm, optimization]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python / CUDA
framework: vLLM / TensorRT-LLM / SGLang
---
# Key-Value (KV) Cache
## 매 한 줄
> **"매 transformer inference 의 의 의 의 의 K, V 의 store"**. 매 매 token generate 의 매 quadratic → 매 linear 의 의 의 의 의 enable. 매 modern: 매 paged attention (vLLM Kwon 2023), 매 prefix cache, 매 quantized KV.
## 매 핵심
### 매 motivation
- **Without cache**: 매 매 step 의 의 의 의 모든 prev token 의 attention 의 recompute → O(N²).
- **With cache**: 매 매 K, V 의 store → 매 single new token 의 attention → O(N).
- **Memory cost**: 매 batch × seq × layers × kv_heads × head_dim × 2 (K + V) × dtype.
### 매 응용
1. Inference acceleration.
2. Long-context generation.
3. Streaming response.
4. Multi-turn chat.
### 매 modern technique
- **Paged Attention** (vLLM 2023): 매 OS-style page.
- **Prefix caching**: 매 same prefix 의 의 reuse.
- **GQA + KV** (Llama): 매 cache size ↓.
- **MLA** (DeepSeek-V2): 매 latent compression.
- **Quantized KV** (FP8, INT4).
- **Sliding window** (Mistral): 매 truncate.
## 💻 패턴
### Basic KV cache (educational)
```python
import torch
class CachedAttention:
def __init__(self):
self.k_cache = [] # 매 list of [batch, n_heads, head_dim]
self.v_cache = []
def forward(self, q_new, k_new, v_new):
# 매 append new
self.k_cache.append(k_new)
self.v_cache.append(v_new)
# 매 stack
k_all = torch.stack(self.k_cache, dim=2) # 매 [B, H, T, D]
v_all = torch.stack(self.v_cache, dim=2)
# 매 attention with new q only (1 token)
attn = (q_new @ k_all.transpose(-1, -2) / k_all.size(-1) ** 0.5).softmax(-1)
return attn @ v_all
```
### Memory size calculation
```python
def kv_cache_bytes(batch, seq, layers, kv_heads, head_dim, dtype_bytes=2):
return batch * seq * layers * kv_heads * head_dim * 2 * dtype_bytes
# 매 Llama 70B GQA: 80 layers, 8 KV heads, 128 head_dim
# 매 batch=1, seq=4096, bf16
size_gb = kv_cache_bytes(1, 4096, 80, 8, 128, 2) / 1e9
# 매 ≈ 1.34 GB
```
### vLLM (production)
```python
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-70B-Instruct', max_num_seqs=64)
sampling = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(prompts, sampling)
# 매 internally: paged attention + dynamic batching
```
### Prefix cache (vLLM)
```python
# 매 same system prompt 의 의 의 cache 의 reuse
# 매 vLLM 의 enable_prefix_caching=True
llm = LLM(model='...', enable_prefix_caching=True)
```
### Anthropic prompt caching
```python
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model='claude-opus-4-7',
max_tokens=1024,
system=[
{'type': 'text', 'text': 'You are an expert.', 'cache_control': {'type': 'ephemeral'}},
{'type': 'text', 'text': long_book_content, 'cache_control': {'type': 'ephemeral'}},
],
messages=[{'role': 'user', 'content': question}],
)
# 매 second call with same prefix = 매 90% cost ↓
```
### Sliding window (Mistral-style)
```python
def sliding_window_kv(k_cache, v_cache, window=4096):
if k_cache.size(2) > window:
k_cache = k_cache[:, :, -window:]
v_cache = v_cache[:, :, -window:]
return k_cache, v_cache
```
### GQA-aware cache
```python
def gqa_kv_cache_size(batch, seq, layers, n_q_heads, n_kv_heads, head_dim):
"""매 GQA: 매 KV heads < Q heads → 매 cache size ↓."""
return batch * seq * layers * n_kv_heads * head_dim * 2
# 매 Llama 70B: 매 8 KV vs 64 Q = 매 8x ↓ cache
```
### MLA (DeepSeek)
```python
class MLA:
"""매 매 K, V 의 의 의 의 의 of low-rank latent."""
def __init__(self, d_model, d_latent=512):
self.W_dkv = nn.Linear(d_model, d_latent) # 매 cache 매 latent only
self.W_uk = nn.Linear(d_latent, d_model) # 매 reconstruct K
self.W_uv = nn.Linear(d_latent, d_model) # 매 reconstruct V
def forward(self, x, kv_cache):
c = self.W_dkv(x)
kv_cache.append(c) # 매 small
# 매 reconstruct on-the-fly
```
### KV quantization
```python
def quantize_kv(k, v, bits=8):
"""매 K, V 의 의 의 INT8 / FP8 의 의 의 store."""
k_int8 = (k * 127 / k.abs().max()).to(torch.int8)
v_int8 = (v * 127 / v.abs().max()).to(torch.int8)
return k_int8, v_int8, k.abs().max(), v.abs().max() # scale factors
```
### Streaming generation
```python
def stream_generate(model, prompt, max_tokens=200):
input_ids = tokenize(prompt)
kv_cache = None
# 매 prefill
out, kv_cache = model.forward(input_ids, kv_cache=None)
for _ in range(max_tokens):
next_token = sample(out[:, -1])
yield next_token
# 매 매 single token forward
out, kv_cache = model.forward(next_token.unsqueeze(0), kv_cache=kv_cache)
```
### Continuous batching (vLLM)
```python
# 매 매 request 의 different stage 의 같이 batch
# 매 vLLM 의 매 (Yu 2022 Orca)
# 매 prefill + decode 의 mix
class ContinuousBatcher:
def step(self, requests):
# 매 prefill new + decode existing
return run_iteration(requests)
```
### Memory profile (debug)
```python
def profile_kv_cache(model, seq_lens):
for seq in seq_lens:
torch.cuda.reset_peak_memory_stats()
out = model.generate(input_ids[:1, :seq])
peak = torch.cuda.max_memory_allocated() / 1e9
print(f'seq={seq}: peak {peak:.2f} GB')
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Production serving | vLLM (paged) |
| Long context | GQA + MLA |
| Multi-turn chat | Prefix caching |
| Memory tight | Quantized KV |
| Very long | Sliding window (Mistral) |
| API | Anthropic prompt cache |
**기본값**: 매 vLLM + 매 GQA + 매 prefix caching + 매 dynamic batching. 매 long context = quantized KV / MLA.
## 🔗 Graph
- 부모: [[Transformer]]
- 변형: [[Paged-Attention]]
- 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]]
- Adjacent: [[Flash Attention]] · [[Grouped-Query Attention (GQA)]] · [[Foundation-Models]]
## 🤖 LLM 활용
**언제**: 매 inference. 매 모든 production LLM serving.
**언제 X**: 매 single-shot research only.
## ❌ 안티패턴
- **No KV cache**: 매 quadratic — unusable for long.
- **Static batching**: 매 GPU underutil.
- **Full-precision KV at scale**: 매 memory waste.
- **Recompute prefix every call**: 매 cost ↑.
## 🧪 검증 / 중복
- Verified (Vaswani 2017, Kwon vLLM 2023, Yu Orca 2022, DeepSeek-V2 MLA).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — KV cache + 매 vLLM / prefix / MLA / sliding / quantize code |