--- id: wiki-2026-0508-inference-coupled-persistence title: Inference-Coupled Persistence category: 10_Wiki/Topics status: verified canonical_id: self aliases: [ICP, Inference-Time Memory, Coupled Persistence] duplicate_of: none source_trust_level: A confidence_score: 0.85 verification_status: applied tags: [llm, memory, inference, kv-cache, persistence] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: vllm --- # Inference-Coupled Persistence ## 매 한 줄 > **"매 KV-cache 의 disk 의 spill 매 conversation 의 resume"**. Inference-Coupled Persistence (ICP) 매 LLM serving system 의 inference state (KV-cache, attention states) 의 durable storage 의 couple 매 pattern. 2026 vLLM 0.7+ / SGLang 매 native support — long conversations cost-effective. ## 매 핵심 ### 매 Why ICP - 매 1M token context 의 KV-cache 매 ~100GB GPU memory 의 consume. - 매 conversation idle 매 hours / days — GPU memory 매 hold cost-prohibitive. - ICP: idle 시 disk 의 evict, resume 시 reload — 매 5-50x cost reduction. ### 매 Storage tiers - **L0 (HBM)**: active inference, < 1ms access. - **L1 (CPU RAM)**: 매 minutes idle, ~10ms reload. - **L2 (NVMe)**: 매 hours idle, ~100ms reload. - **L3 (Object store / S3)**: 매 days idle, ~1-5s reload. ### 매 Coupling guarantees - **Bit-exact resume**: 매 KV-cache 매 quantization-aware serialization. - **Causal consistency**: 매 token N 의 KV 매 strictly token dict | None: path = self.store / f"{conv_id}.pt" if not path.exists(): return None return torch.load(path, map_location="cuda") ``` ### Pattern 3: Tiered eviction policy ```python from dataclasses import dataclass from time import time @dataclass class Session: id: str last_access: float size_gb: float def evict_tier(sessions: list[Session], capacity_gb: float) -> list[Session]: """매 LRU 의 evict — return list of (session, target_tier).""" sessions.sort(key=lambda s: s.last_access) used = sum(s.size_gb for s in sessions) evicted = [] now = time() for s in sessions: if used <= capacity_gb: break idle_min = (now - s.last_access) / 60 if idle_min < 5: target = "HBM" elif idle_min < 60: target = "CPU" elif idle_min < 1440: target = "NVMe" else: target = "S3" evicted.append((s, target)) used -= s.size_gb return evicted ``` ### Pattern 4: Resume with prefix matching ```python def resume_with_prefix(checkpoint: dict, new_prompt: str, tokenizer) -> tuple[list, list]: """매 checkpoint 의 prefix 의 reuse — 매 prefix mismatch 의 from-scratch.""" saved_tokens = checkpoint["tokens"] new_tokens = tokenizer.encode(new_prompt) common = 0 for i in range(min(len(saved_tokens), len(new_tokens))): if saved_tokens[i] != new_tokens[i]: break common = i + 1 if common == 0: return [], new_tokens kept_kv = [k[:, :common] for k in checkpoint["kv"]] return kept_kv, new_tokens[common:] ``` ### Pattern 5: Quantized serialization ```python def serialize_kv_int8(kv: torch.Tensor) -> tuple[bytes, dict]: """매 fp16 KV 의 int8 의 quantize — 매 50% storage save.""" scale = kv.abs().amax() / 127 q = (kv / scale).round().clamp(-128, 127).to(torch.int8) return q.numpy().tobytes(), {"scale": scale.item(), "shape": list(q.shape)} def deserialize_kv_int8(data: bytes, meta: dict) -> torch.Tensor: import numpy as np arr = np.frombuffer(data, dtype=np.int8).reshape(meta["shape"]) return torch.from_numpy(arr).to(torch.float16) * meta["scale"] ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Conversation < 5min idle | HBM 만. | | Long conversation, hours idle | NVMe tier. | | Multi-day project context | S3 + prefix cache. | | Cost-sensitive multi-tenant | Aggressive 4-tier ICP. | | Latency-sensitive (< 10ms) | HBM only — ICP 의 X. | **기본값**: 4-tier (HBM → CPU → NVMe → S3) 매 LRU eviction, fp8 KV-cache, prefix caching enabled. ## 🔗 Graph - 부모: [[KV-Cache]] - 변형: [[Prefix-Caching]] · [[PagedAttention]] - 응용: [[LLM_Optimization_and_Deployment_Strategies|vLLM]] - Adjacent: [[Continuous-Batching]] · [[Flash Attention]] ## 🤖 LLM 활용 **언제**: 매 production LLM serving 매 multi-hour conversations, 매 cost optimization, 매 multi-tenant 100K+ sessions. **언제 X**: Single-shot inference (no persistence needed), strict-latency RT systems (< 10ms first-token). ## ❌ 안티패턴 - **Naive pickle of KV**: 매 quantization-unaware — 5-10x bigger than needed. - **No atomic write**: crash 의 corrupted checkpoint 의 unrecoverable. - **Per-token checkpoint**: 매 IOPS storm — batch 의 N tokens. - **Resume without prefix check**: silent correctness bug. ## 🧪 검증 / 중복 - Verified: vLLM 0.7 docs (2025), SGLang RadixAttention paper (2024), Mooncake architecture (2024). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full content with vLLM 2026 patterns, tiered eviction, quantized serialization |