--- id: ai-llm-cost-optimization title: LLM Cost 최적화 — Cache / Routing / Batch category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, llm, cost, optimization, vibe-coding] tech_stack: { language: "TS", applicable_to: ["Backend"] } applied_in: [] aliases: [LLM cost, prompt cache, batch API, model routing, semantic cache] --- # LLM Cost 최적화 > $/1M tokens 빠르게 누적. **Prompt cache (자동) / semantic cache / model routing / batch API / 작은 모델 fallback** 5종. 80% 비용 감소 가능. ## 📖 핵심 개념 - Prompt cache: provider 가 반복 prefix 재사용 (Anthropic / OpenAI 자동). - Semantic cache: 같은 의미 query → 결과 cache. - Model routing: 단순 = 작은 모델, 복잡 = 큰 모델. - Batch API: 24h delay = 50% 할인. ## 💻 코드 패턴 ### Prompt cache (Anthropic, 자동) ```ts const r = await anthropic.messages.create({ model: 'claude-opus-4-7', system: [ { type: 'text', text: hugeSystemPrompt, cache_control: { type: 'ephemeral' } }, ], messages, }); // 같은 system 두 번째 호출 = 90% 할인 (cached prefix) ``` → 5분 lifetime, 큰 system prompt 에 강력. ### OpenAI prompt cache (자동) ``` 1024+ token prefix 자동 cache (조건 충족 시). 50% 할인 cached portion. ``` ### Semantic cache ```ts import { OpenAI } from 'openai'; async function semanticCache(query: string): Promise { const emb = await openai.embeddings.create({ model: 'text-embedding-3-small', input: query }); const hit = await redis.queryNearest(emb.data[0].embedding, { threshold: 0.95 }); if (hit) return hit.answer; return null; } async function answer(query: string): Promise { const cached = await semanticCache(query); if (cached) return cached; const r = await openai.chat.completions.create({...}); await redis.storeWithEmbedding(query, r.choices[0].message.content!); return r.choices[0].message.content!; } ``` ⚠️ 사용자별 / context 다른 답이면 cache 불가. ### Model routing ```ts async function route(query: string): Promise { // Heuristic if (query.length < 100) return 'gpt-4o-mini'; if (looksLikeMath(query) || looksLikeCode(query)) return 'gpt-4o'; // 또는 작은 classifier const r = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [{ role: 'system', content: 'Classify: simple/complex' }, { role: 'user', content: query }], max_tokens: 5, }); return r.choices[0].message.content?.includes('complex') ? 'gpt-4o' : 'gpt-4o-mini'; } ``` ### Batch API (50% 할인) ```ts // 1. JSONL 만들기 const lines = items.map(i => JSON.stringify({ custom_id: i.id, method: 'POST', url: '/v1/chat/completions', body: { model: 'gpt-4o', messages: [{ role: 'user', content: i.text }] }, })); // 2. Upload const file = await openai.files.create({ file: Readable.from(lines.join('\n')), purpose: 'batch', }); // 3. Job const batch = await openai.batches.create({ input_file_id: file.id, endpoint: '/v1/chat/completions', completion_window: '24h', }); // 4. Poll while (batch.status !== 'completed') { await sleep(60_000); batch = await openai.batches.retrieve(batch.id); } // 5. Download const out = await openai.files.content(batch.output_file_id); ``` → 50% 비용 감소. 24h 안 처리 OK 한 작업 (일별 분석, embedding). ### Compression (시스템 prompt) ```ts // ❌ 5000 token system 매 호출 const system = `You are an expert ... [long]`; // ✅ 짧게 + cache const system = 'You are a concise customer support agent. Be brief.'; // 또는 cached prefix ``` ### Cheaper model 시도 → eval → 결정 ```ts const cheap = await callModel('gpt-4o-mini', query); const correct = await evalCorrectness(cheap); if (correct) return cheap; // fallback return await callModel('gpt-4o', query); ``` ### Token-counting (사전 추정) ```ts import { encoding_for_model } from 'tiktoken'; const enc = encoding_for_model('gpt-4o'); const tokens = enc.encode(prompt).length; const estimatedCost = tokens * 0.0000025; // $/token if (tokens > 100_000) throw new Error('too expensive — split'); ``` ### Truncate / summarize history ```ts function trimHistory(messages: Message[], maxTokens: number): Message[] { const total = messages.reduce((s, m) => s + countTokens(m.content), 0); if (total < maxTokens) return messages; // 1. 첫 system 유지 // 2. 가장 오래된 user/assistant 자르기 // 3. 또는 summarize old → "Summary: ..." return [...messages.slice(0, 1), summarize(...messages.slice(1, -10)), ...messages.slice(-10)]; } ``` ### LLM 콜 줄이기 — RAG 만 충분한 경우 ```ts // 단순 lookup → DB 직접 (LLM X) if (isSimpleLookup(query)) return db.faq.find(query); // 복잡 → RAG + LLM return await ragAnswer(query); ``` ### 비용 추적 / alarm ```ts class CostTracker { static daily = new Map(); static record(userId: string, cost: number) { const key = `${userId}:${new Date().toDateString()}`; daily.set(key, (daily.get(key) ?? 0) + cost); if (daily.get(key)! > 10) alarm('user spent $10 today'); } } ``` ### 모델 cost 표 (2026) ``` gpt-4o: $2.5 in / $10 out per 1M gpt-4o-mini: $0.15 / $0.60 claude-opus-4-7: $15 / $75 claude-sonnet-4-6: $3 / $15 claude-haiku-4-5: $0.80 / $4 gemini-2.5-pro: $2.5 / $15 gemini-2.5-flash: $0.30 / $2.50 ``` → 요청별 적절한 모델. ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | 같은 system 반복 | Prompt cache | | 자주 같은 의미 query | Semantic cache | | 다양한 난이도 | Model routing | | 큰 batch 비실시간 | Batch API | | 큰 system 또는 예시 | Prefix cache | | 단순 lookup | DB 직접 (LLM X) | ## ❌ 안티패턴 - **모든 작업 큰 모델**: 80% 더 비쌈. - **Cache 무시**: 같은 system 반복 = 비용 증가. - **Token count 추정 안 함**: 무한 재시도 = 청구서 폭발. - **Embedding cache 없음**: 같은 query 매번 embedding. - **Batch 가능한데 sync**: 2x 비싸. - **Streaming + 사용자 안 봄**: 끝까지 토큰 비용. - **History 무한**: 매 turn 비용 ↑. ## 🤖 LLM 활용 힌트 - 5종 (cache / semantic cache / routing / batch / 작은 모델 fallback) 조합 = 80% 절감. - 토큰 사전 추정 + 한도 alarm. - Token cost 표 자주 업데이트. ## 🔗 관련 문서 - [[AI_Local_LLM_Inference]] - [[AI_Fine_Tuning_vs_Prompting]] - [[AI_LLM_Eval_Patterns]]