2nd/10_Wiki/Topics/Coding/AI_Prompt_Caching.md

---
id: ai-prompt-caching
title: Prompt Caching — Anthropic / OpenAI / 비용 50-90% 감소
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, llm, cache, vibe-coding]
tech_stack: { language: "TS", applicable_to: ["Backend"] }
applied_in: []
aliases: [prompt cache, ephemeral cache, cache_control, KV cache, context cache]
---

# Prompt Caching

> 큰 system / context 반복 = 비용 폭발. **Anthropic explicit (cache_control), OpenAI implicit (자동), Gemini context cache**. 50-90% 비용 절감 + latency 대폭 감소.

## 📖 핵심 개념
- KV cache: GPU 안 attention values cache.
- 5min TTL (Anthropic ephemeral) / 1h optional.
- 같은 prefix = cached.
- Cache write > read (조금) — 1번 사용해도 보통 이득.

## 💻 코드 패턴

### Anthropic (explicit)
```ts
const r = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  max_tokens: 1024,
  system: [
    {
      type: 'text',
      text: hugeSystemPrompt,  // 10,000 tokens
      cache_control: { type: 'ephemeral' },  // cache 표시
    },
  ],
  messages: [{ role: 'user', content: 'Hello' }],
});

console.log(r.usage);
// { cache_creation_input_tokens: 10000, cache_read_input_tokens: 0, ... }
```

```ts
// 5분 안 다시 호출
const r2 = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  system: [
    { type: 'text', text: hugeSystemPrompt, cache_control: { type: 'ephemeral' } },
  ],
  messages: [{ role: 'user', content: 'New question' }],
});
// usage: { cache_read_input_tokens: 10000, cache_creation_input_tokens: 0 }
// 90% 할인 cached portion
```

### Cache 가능 위치
```
1. system (거대 instruction)
2. tools (큰 list)
3. messages (긴 history) — 마지막 cache_control
```

```ts
// 큰 history 의 마지막 message 에 cache_control
messages: [
  { role: 'user', content: 'first' },
  { role: 'assistant', content: 'response 1' },
  // ... 많은 messages
  {
    role: 'user',
    content: [{ type: 'text', text: 'recent' }],
    // 옛 메시지 (이 위치까지) 모두 cache
    cache_control: { type: 'ephemeral' },
  },
  { role: 'user', content: 'newest question' },
];
```

### Anthropic 가격 (2026 기준)
```
Cache write:  base × 1.25 (write 비용 25% 추가)
Cache read:   base × 0.10 (90% 할인)

→ 2번 이상 사용 = 이득.
   N번 사용 = (N + 0.25) × 0.1 × N×base 절감.
```

### 전형 비용
```
Without cache:
  10K tokens × 100 calls = 1M tokens × $15/M = $15

With cache (1 write + 99 read):
  Write:  10K × $18.75/M = $0.19
  Read:   10K × 99 × $1.5/M = $1.49
  Total:  $1.68

→ 89% 절감.
```

### OpenAI (자동, implicit)
```ts
// 1024+ token prefix 자동 cache (조건 충족 시)
// 자동 50% 할인 cached portion

// 응답
console.log(r.usage);
// { prompt_tokens: 10000, prompt_tokens_details: { cached_tokens: 9000 }, ... }
```

→ 신경 안 써도 자동. 단 Anthropic 만큼 강력 X.

### Gemini context cache
```python
import google.generativeai as genai

cache = genai.caching.CachedContent.create(
    model='gemini-1.5-pro',
    system_instruction='...',
    contents=[Part(text='Long context')],
    ttl='1h',
)

model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content('Question')
```

→ 명시적 cache + 1h TTL.

### 활용 case
```
1. RAG context (긴 문서들):
   - System 안 retrieved chunks 큰 → cache
   - 후속 question 에 같은 context

2. Long conversation:
   - 옛 messages cache + 새 message 추가

3. Tool definitions (큰 list):
   - 같은 tools 매번 — cache

4. Few-shot examples:
   - 큰 example set — cache

5. Code review (전체 file):
   - File 큰 → cache
   - 여러 review 질문
```

### TTL 전략
```
Default: 5min
Long: 1h (Anthropic)

→ 5min 안 reuse 가능 = ephemeral.
→ 자주 사용 + 일관 = 1h 이득 더 큼.
```

```ts
cache_control: { type: 'ephemeral', ttl: '1h' }
```

### Cache invalidation
```
Prefix 변경 = miss.
중간 글자 변경 = miss.
순서 변경 = miss.

→ Stable prefix 가 핵심.
```

```ts
// ❌ 매번 다른 timestamp
const system = `Today: ${new Date()}\n${hugePrompt}`;  // 항상 miss

// ✅ Prefix stable
const system = [
  { type: 'text', text: hugePrompt, cache_control: { type: 'ephemeral' } },
  { type: 'text', text: `Today: ${new Date()}` },  // dynamic 끝
];
```

### Multi-cache (4 break points)
```ts
// Anthropic 4 cache_control max
system: [
  { type: 'text', text: companyKnowledge, cache_control: { type: 'ephemeral' } },  // L1
],
tools: [
  ...allTools,
  // 마지막 tool 에 cache_control = 모든 tools cache
],
messages: [
  { role: 'user', content: longHistory, cache_control: { type: 'ephemeral' } },  // L2
  { role: 'user', content: latest },
],
```

→ 다양 layer cache.

### Monitoring
```ts
function logCache(usage: any) {
  metrics.gauge('llm.cache_hit_rate', usage.cache_read_input_tokens / (usage.cache_read_input_tokens + usage.input_tokens));
  metrics.counter('llm.cache_creation_tokens', usage.cache_creation_input_tokens);
}
```

### Semantic cache 와 차이
```
Prompt cache:  같은 prefix (textual) → KV cache 재사용
Semantic cache: 비슷한 query → 답 cache (embedding 비교)

→ 다른 layer. 둘 다 사용 가능.
```

### When NOT to cache
```
- 매번 다른 prompt: 의미 X.
- 작은 prompt (< 1024 token Anthropic): 의미 적음.
- Single-shot (재사용 X): write cost 만.
- 매우 가끔 사용 (5min TTL 만료): miss.
```

### Common 패턴 (Anthropic)
```ts
async function chatWithKnowledge(userMsg: string) {
  return await anthropic.messages.create({
    model: 'claude-opus-4-7',
    max_tokens: 1024,
    system: [
      {
        type: 'text',
        text: COMPANY_KNOWLEDGE,  // 큰 (50K tokens)
        cache_control: { type: 'ephemeral', ttl: '1h' },
      },
    ],
    messages: [{ role: 'user', content: userMsg }],
  });
}

// 첫 호출:    50K write = $0.94
// 후속 호출:  50K read = $0.075
```

### Cost calculator
```ts
function calcCost(usage: Usage, model: string): number {
  const rates = MODEL_RATES[model];
  return (
    usage.cache_creation_input_tokens * rates.cacheWrite +
    usage.cache_read_input_tokens * rates.cacheRead +
    usage.input_tokens * rates.input +
    usage.output_tokens * rates.output
  );
}
```

## 🤔 의사결정 기준
| 상황 | 사용 |
|---|---|
| 큰 system 반복 | Cache (90% 절감) |
| RAG context 다시 | Cache |
| Long conversation | Cache 옛 messages |
| 1회성 prompt | No cache |
| Test eval 매번 다름 | No cache |
| Code review 한 file | Cache file content |

## ❌ 안티패턴
- **매번 다른 prefix (timestamp 시작)**: 항상 miss.
- **1번 사용 + cache write**: 비용 손해.
- **Cache_control 4개 초과 시도 (Anthropic)**: 에러.
- **Cache 가정 + miss 무관심**: bill 이상.
- **Short TTL + 가끔 사용**: 매번 miss.
- **OpenAI 자동 cache 가정 + 1024 미만**: cache X.

## 🤖 LLM 활용 힌트
- Anthropic = explicit (큰 절감).
- OpenAI = 자동 (1024+ 자동).
- Stable prefix 디자인.
- Hit rate monitoring + alert.

## 🔗 관련 문서
- [[AI_LLM_Cost_Optimization]]
- [[AI_RAG_Pattern_Basics]]
- [[AI_Prompt_Engineering_Patterns]]