Files
2nd/10_Wiki/Topics/Other/Inference-Coupled Persistence.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

6.5 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-inference-coupled-persistence Inference-Coupled Persistence 10_Wiki/Topics verified self
ICP
Inference-Time Memory
Coupled Persistence
none A 0.85 applied
llm
memory
inference
kv-cache
persistence
2026-05-10 pending
language framework
python vllm

Inference-Coupled Persistence

매 한 줄

"매 KV-cache 의 disk 의 spill 매 conversation 의 resume". Inference-Coupled Persistence (ICP) 매 LLM serving system 의 inference state (KV-cache, attention states) 의 durable storage 의 couple 매 pattern. 2026 vLLM 0.7+ / SGLang 매 native support — long conversations cost-effective.

매 핵심

매 Why ICP

  • 매 1M token context 의 KV-cache 매 ~100GB GPU memory 의 consume.
  • 매 conversation idle 매 hours / days — GPU memory 매 hold cost-prohibitive.
  • ICP: idle 시 disk 의 evict, resume 시 reload — 매 5-50x cost reduction.

매 Storage tiers

  • L0 (HBM): active inference, < 1ms access.
  • L1 (CPU RAM): 매 minutes idle, ~10ms reload.
  • L2 (NVMe): 매 hours idle, ~100ms reload.
  • L3 (Object store / S3): 매 days idle, ~1-5s reload.

매 Coupling guarantees

  • Bit-exact resume: 매 KV-cache 매 quantization-aware serialization.
  • Causal consistency: 매 token N 의 KV 매 strictly token <N 의 reflect.
  • Atomic checkpoint: partial-write 의 detect 의 crash recovery.

매 응용

  1. Long-running coding agent (multi-day session).
  2. Customer support bot (hours-long conversation history).
  3. Research assistant (multi-week project context).
  4. Multi-tenant LLM serving (100K concurrent idle sessions).

💻 패턴

Pattern 1: vLLM KV-cache offload (2026)

from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs

engine_args = EngineArgs(
    model="meta-llama/Llama-3.3-70B-Instruct",
    enable_prefix_caching=True,
    kv_cache_dtype="fp8",
    cpu_offload_gb=200,         # CPU RAM tier
    swap_space=400,              # NVMe tier (GB)
    block_size=32,
)
llm = LLM.from_engine_args(engine_args)

Pattern 2: Conversation checkpoint

import torch
from pathlib import Path

class ConversationCheckpoint:
    def __init__(self, store_dir: Path):
        self.store = store_dir
        self.store.mkdir(exist_ok=True, parents=True)

    def save(self, conv_id: str, kv_blocks: list[torch.Tensor], tokens: list[int]):
        path = self.store / f"{conv_id}.pt"
        tmp = path.with_suffix(".tmp")
        torch.save({
            "kv": [b.cpu() for b in kv_blocks],
            "tokens": tokens,
            "version": 2,
        }, tmp)
        tmp.rename(path)  # atomic

    def load(self, conv_id: str) -> dict | None:
        path = self.store / f"{conv_id}.pt"
        if not path.exists():
            return None
        return torch.load(path, map_location="cuda")

Pattern 3: Tiered eviction policy

from dataclasses import dataclass
from time import time

@dataclass
class Session:
    id: str
    last_access: float
    size_gb: float

def evict_tier(sessions: list[Session], capacity_gb: float) -> list[Session]:
    """매 LRU 의 evict — return list of (session, target_tier)."""
    sessions.sort(key=lambda s: s.last_access)
    used = sum(s.size_gb for s in sessions)
    evicted = []
    now = time()
    for s in sessions:
        if used <= capacity_gb:
            break
        idle_min = (now - s.last_access) / 60
        if idle_min < 5:
            target = "HBM"
        elif idle_min < 60:
            target = "CPU"
        elif idle_min < 1440:
            target = "NVMe"
        else:
            target = "S3"
        evicted.append((s, target))
        used -= s.size_gb
    return evicted

Pattern 4: Resume with prefix matching

def resume_with_prefix(checkpoint: dict, new_prompt: str, tokenizer) -> tuple[list, list]:
    """매 checkpoint 의 prefix 의 reuse — 매 prefix mismatch 의 from-scratch."""
    saved_tokens = checkpoint["tokens"]
    new_tokens = tokenizer.encode(new_prompt)
    common = 0
    for i in range(min(len(saved_tokens), len(new_tokens))):
        if saved_tokens[i] != new_tokens[i]:
            break
        common = i + 1
    if common == 0:
        return [], new_tokens
    kept_kv = [k[:, :common] for k in checkpoint["kv"]]
    return kept_kv, new_tokens[common:]

Pattern 5: Quantized serialization

def serialize_kv_int8(kv: torch.Tensor) -> tuple[bytes, dict]:
    """매 fp16 KV 의 int8 의 quantize — 매 50% storage save."""
    scale = kv.abs().amax() / 127
    q = (kv / scale).round().clamp(-128, 127).to(torch.int8)
    return q.numpy().tobytes(), {"scale": scale.item(), "shape": list(q.shape)}

def deserialize_kv_int8(data: bytes, meta: dict) -> torch.Tensor:
    import numpy as np
    arr = np.frombuffer(data, dtype=np.int8).reshape(meta["shape"])
    return torch.from_numpy(arr).to(torch.float16) * meta["scale"]

매 결정 기준

상황 Approach
Conversation < 5min idle HBM 만.
Long conversation, hours idle NVMe tier.
Multi-day project context S3 + prefix cache.
Cost-sensitive multi-tenant Aggressive 4-tier ICP.
Latency-sensitive (< 10ms) HBM only — ICP 의 X.

기본값: 4-tier (HBM → CPU → NVMe → S3) 매 LRU eviction, fp8 KV-cache, prefix caching enabled.

🔗 Graph

🤖 LLM 활용

언제: 매 production LLM serving 매 multi-hour conversations, 매 cost optimization, 매 multi-tenant 100K+ sessions. 언제 X: Single-shot inference (no persistence needed), strict-latency RT systems (< 10ms first-token).

안티패턴

  • Naive pickle of KV: 매 quantization-unaware — 5-10x bigger than needed.
  • No atomic write: crash 의 corrupted checkpoint 의 unrecoverable.
  • Per-token checkpoint: 매 IOPS storm — batch 의 N tokens.
  • Resume without prefix check: silent correctness bug.

🧪 검증 / 중복

  • Verified: vLLM 0.7 docs (2025), SGLang RadixAttention paper (2024), Mooncake architecture (2024).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full content with vLLM 2026 patterns, tiered eviction, quantized serialization