f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
196 lines
6.5 KiB
Markdown
196 lines
6.5 KiB
Markdown
---
|
|
id: wiki-2026-0508-inference-coupled-persistence
|
|
title: Inference-Coupled Persistence
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [ICP, Inference-Time Memory, Coupled Persistence]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.85
|
|
verification_status: applied
|
|
tags: [llm, memory, inference, kv-cache, persistence]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: vllm
|
|
---
|
|
|
|
# Inference-Coupled Persistence
|
|
|
|
## 매 한 줄
|
|
> **"매 KV-cache 의 disk 의 spill 매 conversation 의 resume"**. Inference-Coupled Persistence (ICP) 매 LLM serving system 의 inference state (KV-cache, attention states) 의 durable storage 의 couple 매 pattern. 2026 vLLM 0.7+ / SGLang 매 native support — long conversations cost-effective.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 Why ICP
|
|
- 매 1M token context 의 KV-cache 매 ~100GB GPU memory 의 consume.
|
|
- 매 conversation idle 매 hours / days — GPU memory 매 hold cost-prohibitive.
|
|
- ICP: idle 시 disk 의 evict, resume 시 reload — 매 5-50x cost reduction.
|
|
|
|
### 매 Storage tiers
|
|
- **L0 (HBM)**: active inference, < 1ms access.
|
|
- **L1 (CPU RAM)**: 매 minutes idle, ~10ms reload.
|
|
- **L2 (NVMe)**: 매 hours idle, ~100ms reload.
|
|
- **L3 (Object store / S3)**: 매 days idle, ~1-5s reload.
|
|
|
|
### 매 Coupling guarantees
|
|
- **Bit-exact resume**: 매 KV-cache 매 quantization-aware serialization.
|
|
- **Causal consistency**: 매 token N 의 KV 매 strictly token <N 의 reflect.
|
|
- **Atomic checkpoint**: partial-write 의 detect 의 crash recovery.
|
|
|
|
### 매 응용
|
|
1. Long-running coding agent (multi-day session).
|
|
2. Customer support bot (hours-long conversation history).
|
|
3. Research assistant (multi-week project context).
|
|
4. Multi-tenant LLM serving (100K concurrent idle sessions).
|
|
|
|
## 💻 패턴
|
|
|
|
### Pattern 1: vLLM KV-cache offload (2026)
|
|
```python
|
|
from vllm import LLM, SamplingParams
|
|
from vllm.engine.arg_utils import EngineArgs
|
|
|
|
engine_args = EngineArgs(
|
|
model="meta-llama/Llama-3.3-70B-Instruct",
|
|
enable_prefix_caching=True,
|
|
kv_cache_dtype="fp8",
|
|
cpu_offload_gb=200, # CPU RAM tier
|
|
swap_space=400, # NVMe tier (GB)
|
|
block_size=32,
|
|
)
|
|
llm = LLM.from_engine_args(engine_args)
|
|
```
|
|
|
|
### Pattern 2: Conversation checkpoint
|
|
```python
|
|
import torch
|
|
from pathlib import Path
|
|
|
|
class ConversationCheckpoint:
|
|
def __init__(self, store_dir: Path):
|
|
self.store = store_dir
|
|
self.store.mkdir(exist_ok=True, parents=True)
|
|
|
|
def save(self, conv_id: str, kv_blocks: list[torch.Tensor], tokens: list[int]):
|
|
path = self.store / f"{conv_id}.pt"
|
|
tmp = path.with_suffix(".tmp")
|
|
torch.save({
|
|
"kv": [b.cpu() for b in kv_blocks],
|
|
"tokens": tokens,
|
|
"version": 2,
|
|
}, tmp)
|
|
tmp.rename(path) # atomic
|
|
|
|
def load(self, conv_id: str) -> dict | None:
|
|
path = self.store / f"{conv_id}.pt"
|
|
if not path.exists():
|
|
return None
|
|
return torch.load(path, map_location="cuda")
|
|
```
|
|
|
|
### Pattern 3: Tiered eviction policy
|
|
```python
|
|
from dataclasses import dataclass
|
|
from time import time
|
|
|
|
@dataclass
|
|
class Session:
|
|
id: str
|
|
last_access: float
|
|
size_gb: float
|
|
|
|
def evict_tier(sessions: list[Session], capacity_gb: float) -> list[Session]:
|
|
"""매 LRU 의 evict — return list of (session, target_tier)."""
|
|
sessions.sort(key=lambda s: s.last_access)
|
|
used = sum(s.size_gb for s in sessions)
|
|
evicted = []
|
|
now = time()
|
|
for s in sessions:
|
|
if used <= capacity_gb:
|
|
break
|
|
idle_min = (now - s.last_access) / 60
|
|
if idle_min < 5:
|
|
target = "HBM"
|
|
elif idle_min < 60:
|
|
target = "CPU"
|
|
elif idle_min < 1440:
|
|
target = "NVMe"
|
|
else:
|
|
target = "S3"
|
|
evicted.append((s, target))
|
|
used -= s.size_gb
|
|
return evicted
|
|
```
|
|
|
|
### Pattern 4: Resume with prefix matching
|
|
```python
|
|
def resume_with_prefix(checkpoint: dict, new_prompt: str, tokenizer) -> tuple[list, list]:
|
|
"""매 checkpoint 의 prefix 의 reuse — 매 prefix mismatch 의 from-scratch."""
|
|
saved_tokens = checkpoint["tokens"]
|
|
new_tokens = tokenizer.encode(new_prompt)
|
|
common = 0
|
|
for i in range(min(len(saved_tokens), len(new_tokens))):
|
|
if saved_tokens[i] != new_tokens[i]:
|
|
break
|
|
common = i + 1
|
|
if common == 0:
|
|
return [], new_tokens
|
|
kept_kv = [k[:, :common] for k in checkpoint["kv"]]
|
|
return kept_kv, new_tokens[common:]
|
|
```
|
|
|
|
### Pattern 5: Quantized serialization
|
|
```python
|
|
def serialize_kv_int8(kv: torch.Tensor) -> tuple[bytes, dict]:
|
|
"""매 fp16 KV 의 int8 의 quantize — 매 50% storage save."""
|
|
scale = kv.abs().amax() / 127
|
|
q = (kv / scale).round().clamp(-128, 127).to(torch.int8)
|
|
return q.numpy().tobytes(), {"scale": scale.item(), "shape": list(q.shape)}
|
|
|
|
def deserialize_kv_int8(data: bytes, meta: dict) -> torch.Tensor:
|
|
import numpy as np
|
|
arr = np.frombuffer(data, dtype=np.int8).reshape(meta["shape"])
|
|
return torch.from_numpy(arr).to(torch.float16) * meta["scale"]
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Conversation < 5min idle | HBM 만. |
|
|
| Long conversation, hours idle | NVMe tier. |
|
|
| Multi-day project context | S3 + prefix cache. |
|
|
| Cost-sensitive multi-tenant | Aggressive 4-tier ICP. |
|
|
| Latency-sensitive (< 10ms) | HBM only — ICP 의 X. |
|
|
|
|
**기본값**: 4-tier (HBM → CPU → NVMe → S3) 매 LRU eviction, fp8 KV-cache, prefix caching enabled.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[KV-Cache]]
|
|
- 변형: [[Prefix-Caching]] · [[Paged-Attention]]
|
|
- 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]]
|
|
- Adjacent: [[Continuous-Batching]] · [[FlashAttention]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 production LLM serving 매 multi-hour conversations, 매 cost optimization, 매 multi-tenant 100K+ sessions.
|
|
**언제 X**: Single-shot inference (no persistence needed), strict-latency RT systems (< 10ms first-token).
|
|
|
|
## ❌ 안티패턴
|
|
- **Naive pickle of KV**: 매 quantization-unaware — 5-10x bigger than needed.
|
|
- **No atomic write**: crash 의 corrupted checkpoint 의 unrecoverable.
|
|
- **Per-token checkpoint**: 매 IOPS storm — batch 의 N tokens.
|
|
- **Resume without prefix check**: silent correctness bug.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified: vLLM 0.7 docs (2025), SGLang RadixAttention paper (2024), Mooncake architecture (2024).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — full content with vLLM 2026 patterns, tiered eviction, quantized serialization |
|