[G1-Sync] Manual knowledge update

2026-05-09 21:08:02 +09:00
parent f0befc887a
commit 93ec7e9056
363 changed files with 68333 additions and 64 deletions
@@ -0,0 +1,227 @@
+---
+id: ai-llm-cost-optimization
+title: LLM Cost 최적화 — Cache / Routing / Batch
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [ai, llm, cost, optimization, vibe-coding]
+tech_stack: { language: "TS", applicable_to: ["Backend"] }
+applied_in: []
+aliases: [LLM cost, prompt cache, batch API, model routing, semantic cache]
+---
+
+# LLM Cost 최적화
+
+> $/1M tokens 빠르게 누적. **Prompt cache (자동) / semantic cache / model routing / batch API / 작은 모델 fallback** 5종. 80% 비용 감소 가능.
+
+## 📖 핵심 개념
+- Prompt cache: provider 가 반복 prefix 재사용 (Anthropic / OpenAI 자동).
+- Semantic cache: 같은 의미 query → 결과 cache.
+- Model routing: 단순 = 작은 모델, 복잡 = 큰 모델.
+- Batch API: 24h delay = 50% 할인.
+
+## 💻 코드 패턴
+
+### Prompt cache (Anthropic, 자동)
+```ts
+const r = await anthropic.messages.create({
+  model: 'claude-opus-4-7',
+  system: [
+    { type: 'text', text: hugeSystemPrompt, cache_control: { type: 'ephemeral' } },
+  ],
+  messages,
+});
+
+// 같은 system 두 번째 호출 = 90% 할인 (cached prefix)
+```
+
+→ 5분 lifetime, 큰 system prompt 에 강력.
+
+### OpenAI prompt cache (자동)
+```
+1024+ token prefix 자동 cache (조건 충족 시).
+50% 할인 cached portion.
+```
+
+### Semantic cache
+```ts
+import { OpenAI } from 'openai';
+
+async function semanticCache(query: string): Promise<string | null> {
+  const emb = await openai.embeddings.create({ model: 'text-embedding-3-small', input: query });
+  const hit = await redis.queryNearest(emb.data[0].embedding, { threshold: 0.95 });
+  if (hit) return hit.answer;
+  return null;
+}
+
+async function answer(query: string): Promise<string> {
+  const cached = await semanticCache(query);
+  if (cached) return cached;
+  const r = await openai.chat.completions.create({...});
+  await redis.storeWithEmbedding(query, r.choices[0].message.content!);
+  return r.choices[0].message.content!;
+}
+```
+
+⚠️ 사용자별 / context 다른 답이면 cache 불가.
+
+### Model routing
+```ts
+async function route(query: string): Promise<Model> {
+  // Heuristic
+  if (query.length < 100) return 'gpt-4o-mini';
+  if (looksLikeMath(query) || looksLikeCode(query)) return 'gpt-4o';
+  
+  // 또는 작은 classifier
+  const r = await openai.chat.completions.create({
+    model: 'gpt-4o-mini',
+    messages: [{ role: 'system', content: 'Classify: simple/complex' },
+               { role: 'user', content: query }],
+    max_tokens: 5,
+  });
+  return r.choices[0].message.content?.includes('complex') ? 'gpt-4o' : 'gpt-4o-mini';
+}
+```
+
+### Batch API (50% 할인)
+```ts
+// 1. JSONL 만들기
+const lines = items.map(i => JSON.stringify({
+  custom_id: i.id,
+  method: 'POST',
+  url: '/v1/chat/completions',
+  body: { model: 'gpt-4o', messages: [{ role: 'user', content: i.text }] },
+}));
+
+// 2. Upload
+const file = await openai.files.create({
+  file: Readable.from(lines.join('\n')),
+  purpose: 'batch',
+});
+
+// 3. Job
+const batch = await openai.batches.create({
+  input_file_id: file.id,
+  endpoint: '/v1/chat/completions',
+  completion_window: '24h',
+});
+
+// 4. Poll
+while (batch.status !== 'completed') {
+  await sleep(60_000);
+  batch = await openai.batches.retrieve(batch.id);
+}
+
+// 5. Download
+const out = await openai.files.content(batch.output_file_id);
+```
+
+→ 50% 비용 감소. 24h 안 처리 OK 한 작업 (일별 분석, embedding).
+
+### Compression (시스템 prompt)
+```ts
+// ❌ 5000 token system 매 호출
+const system = `You are an expert ... [long]`;
+
+// ✅ 짧게 + cache
+const system = 'You are a concise customer support agent. Be brief.';
+// 또는 cached prefix
+```
+
+### Cheaper model 시도 → eval → 결정
+```ts
+const cheap = await callModel('gpt-4o-mini', query);
+const correct = await evalCorrectness(cheap);
+if (correct) return cheap;
+// fallback
+return await callModel('gpt-4o', query);
+```
+
+### Token-counting (사전 추정)
+```ts
+import { encoding_for_model } from 'tiktoken';
+
+const enc = encoding_for_model('gpt-4o');
+const tokens = enc.encode(prompt).length;
+const estimatedCost = tokens * 0.0000025; // $/token
+
+if (tokens > 100_000) throw new Error('too expensive — split');
+```
+
+### Truncate / summarize history
+```ts
+function trimHistory(messages: Message[], maxTokens: number): Message[] {
+  const total = messages.reduce((s, m) => s + countTokens(m.content), 0);
+  if (total < maxTokens) return messages;
+  // 1. 첫 system 유지
+  // 2. 가장 오래된 user/assistant 자르기
+  // 3. 또는 summarize old → "Summary: ..."
+  return [...messages.slice(0, 1), summarize(...messages.slice(1, -10)), ...messages.slice(-10)];
+}
+```
+
+### LLM 콜 줄이기 — RAG 만 충분한 경우
+```ts
+// 단순 lookup → DB 직접 (LLM X)
+if (isSimpleLookup(query)) return db.faq.find(query);
+
+// 복잡 → RAG + LLM
+return await ragAnswer(query);
+```
+
+### 비용 추적 / alarm
+```ts
+class CostTracker {
+  static daily = new Map<string, number>();
+  static record(userId: string, cost: number) {
+    const key = `${userId}:${new Date().toDateString()}`;
+    daily.set(key, (daily.get(key) ?? 0) + cost);
+    if (daily.get(key)! > 10) alarm('user spent $10 today');
+  }
+}
+```
+
+### 모델 cost 표 (2026)
+```
+gpt-4o:           $2.5 in / $10 out per 1M
+gpt-4o-mini:      $0.15 / $0.60
+claude-opus-4-7:  $15 / $75
+claude-sonnet-4-6: $3 / $15
+claude-haiku-4-5: $0.80 / $4
+gemini-2.5-pro:   $2.5 / $15
+gemini-2.5-flash: $0.30 / $2.50
+```
+
+→ 요청별 적절한 모델.
+
+## 🤔 의사결정 기준
+| 상황 | 추천 |
+|---|---|
+| 같은 system 반복 | Prompt cache |
+| 자주 같은 의미 query | Semantic cache |
+| 다양한 난이도 | Model routing |
+| 큰 batch 비실시간 | Batch API |
+| 큰 system 또는 예시 | Prefix cache |
+| 단순 lookup | DB 직접 (LLM X) |
+
+## ❌ 안티패턴
+- **모든 작업 큰 모델**: 80% 더 비쌈.
+- **Cache 무시**: 같은 system 반복 = 비용 증가.
+- **Token count 추정 안 함**: 무한 재시도 = 청구서 폭발.
+- **Embedding cache 없음**: 같은 query 매번 embedding.
+- **Batch 가능한데 sync**: 2x 비싸.
+- **Streaming + 사용자 안 봄**: 끝까지 토큰 비용.
+- **History 무한**: 매 turn 비용 ↑.
+
+## 🤖 LLM 활용 힌트
+- 5종 (cache / semantic cache / routing / batch / 작은 모델 fallback) 조합 = 80% 절감.
+- 토큰 사전 추정 + 한도 alarm.
+- Token cost 표 자주 업데이트.
+
+## 🔗 관련 문서
+- [[AI_Local_LLM_Inference]]
+- [[AI_Fine_Tuning_vs_Prompting]]
+- [[AI_LLM_Eval_Patterns]]