Files
2nd/10_Wiki/Topics/Coding/AI_LLM_Cost_Optimization.md
T
2026-05-09 21:08:02 +09:00

6.4 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-llm-cost-optimization LLM Cost 최적화 — Cache / Routing / Batch Coding draft B conceptual 2026-05-09 2026-05-09
ai
llm
cost
optimization
vibe-coding
language applicable_to
TS
Backend
LLM cost
prompt cache
batch API
model routing
semantic cache

LLM Cost 최적화

$/1M tokens 빠르게 누적. Prompt cache (자동) / semantic cache / model routing / batch API / 작은 모델 fallback 5종. 80% 비용 감소 가능.

📖 핵심 개념

  • Prompt cache: provider 가 반복 prefix 재사용 (Anthropic / OpenAI 자동).
  • Semantic cache: 같은 의미 query → 결과 cache.
  • Model routing: 단순 = 작은 모델, 복잡 = 큰 모델.
  • Batch API: 24h delay = 50% 할인.

💻 코드 패턴

Prompt cache (Anthropic, 자동)

const r = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  system: [
    { type: 'text', text: hugeSystemPrompt, cache_control: { type: 'ephemeral' } },
  ],
  messages,
});

// 같은 system 두 번째 호출 = 90% 할인 (cached prefix)

→ 5분 lifetime, 큰 system prompt 에 강력.

OpenAI prompt cache (자동)

1024+ token prefix 자동 cache (조건 충족 시).
50% 할인 cached portion.

Semantic cache

import { OpenAI } from 'openai';

async function semanticCache(query: string): Promise<string | null> {
  const emb = await openai.embeddings.create({ model: 'text-embedding-3-small', input: query });
  const hit = await redis.queryNearest(emb.data[0].embedding, { threshold: 0.95 });
  if (hit) return hit.answer;
  return null;
}

async function answer(query: string): Promise<string> {
  const cached = await semanticCache(query);
  if (cached) return cached;
  const r = await openai.chat.completions.create({...});
  await redis.storeWithEmbedding(query, r.choices[0].message.content!);
  return r.choices[0].message.content!;
}

⚠️ 사용자별 / context 다른 답이면 cache 불가.

Model routing

async function route(query: string): Promise<Model> {
  // Heuristic
  if (query.length < 100) return 'gpt-4o-mini';
  if (looksLikeMath(query) || looksLikeCode(query)) return 'gpt-4o';
  
  // 또는 작은 classifier
  const r = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'system', content: 'Classify: simple/complex' },
               { role: 'user', content: query }],
    max_tokens: 5,
  });
  return r.choices[0].message.content?.includes('complex') ? 'gpt-4o' : 'gpt-4o-mini';
}

Batch API (50% 할인)

// 1. JSONL 만들기
const lines = items.map(i => JSON.stringify({
  custom_id: i.id,
  method: 'POST',
  url: '/v1/chat/completions',
  body: { model: 'gpt-4o', messages: [{ role: 'user', content: i.text }] },
}));

// 2. Upload
const file = await openai.files.create({
  file: Readable.from(lines.join('\n')),
  purpose: 'batch',
});

// 3. Job
const batch = await openai.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h',
});

// 4. Poll
while (batch.status !== 'completed') {
  await sleep(60_000);
  batch = await openai.batches.retrieve(batch.id);
}

// 5. Download
const out = await openai.files.content(batch.output_file_id);

→ 50% 비용 감소. 24h 안 처리 OK 한 작업 (일별 분석, embedding).

Compression (시스템 prompt)

// ❌ 5000 token system 매 호출
const system = `You are an expert ... [long]`;

// ✅ 짧게 + cache
const system = 'You are a concise customer support agent. Be brief.';
// 또는 cached prefix

Cheaper model 시도 → eval → 결정

const cheap = await callModel('gpt-4o-mini', query);
const correct = await evalCorrectness(cheap);
if (correct) return cheap;
// fallback
return await callModel('gpt-4o', query);

Token-counting (사전 추정)

import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o');
const tokens = enc.encode(prompt).length;
const estimatedCost = tokens * 0.0000025; // $/token

if (tokens > 100_000) throw new Error('too expensive — split');

Truncate / summarize history

function trimHistory(messages: Message[], maxTokens: number): Message[] {
  const total = messages.reduce((s, m) => s + countTokens(m.content), 0);
  if (total < maxTokens) return messages;
  // 1. 첫 system 유지
  // 2. 가장 오래된 user/assistant 자르기
  // 3. 또는 summarize old → "Summary: ..."
  return [...messages.slice(0, 1), summarize(...messages.slice(1, -10)), ...messages.slice(-10)];
}

LLM 콜 줄이기 — RAG 만 충분한 경우

// 단순 lookup → DB 직접 (LLM X)
if (isSimpleLookup(query)) return db.faq.find(query);

// 복잡 → RAG + LLM
return await ragAnswer(query);

비용 추적 / alarm

class CostTracker {
  static daily = new Map<string, number>();
  static record(userId: string, cost: number) {
    const key = `${userId}:${new Date().toDateString()}`;
    daily.set(key, (daily.get(key) ?? 0) + cost);
    if (daily.get(key)! > 10) alarm('user spent $10 today');
  }
}

모델 cost 표 (2026)

gpt-4o:           $2.5 in / $10 out per 1M
gpt-4o-mini:      $0.15 / $0.60
claude-opus-4-7:  $15 / $75
claude-sonnet-4-6: $3 / $15
claude-haiku-4-5: $0.80 / $4
gemini-2.5-pro:   $2.5 / $15
gemini-2.5-flash: $0.30 / $2.50

→ 요청별 적절한 모델.

🤔 의사결정 기준

상황 추천
같은 system 반복 Prompt cache
자주 같은 의미 query Semantic cache
다양한 난이도 Model routing
큰 batch 비실시간 Batch API
큰 system 또는 예시 Prefix cache
단순 lookup DB 직접 (LLM X)

안티패턴

  • 모든 작업 큰 모델: 80% 더 비쌈.
  • Cache 무시: 같은 system 반복 = 비용 증가.
  • Token count 추정 안 함: 무한 재시도 = 청구서 폭발.
  • Embedding cache 없음: 같은 query 매번 embedding.
  • Batch 가능한데 sync: 2x 비싸.
  • Streaming + 사용자 안 봄: 끝까지 토큰 비용.
  • History 무한: 매 turn 비용 ↑.

🤖 LLM 활용 힌트

  • 5종 (cache / semantic cache / routing / batch / 작은 모델 fallback) 조합 = 80% 절감.
  • 토큰 사전 추정 + 한도 alarm.
  • Token cost 표 자주 업데이트.

🔗 관련 문서