Files
2nd/10_Wiki/Topics/Coding/AI_Token_Budget_Patterns.md
T
2026-05-09 22:47:42 +09:00

8.6 KiB
Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-token-budget-patterns Token Budget — context limit / truncation / window Coding draft B conceptual 2026-05-09 2026-05-09
ai
llm
tokens
vibe-coding
language applicable_to
TS / Python
Backend
AI
token budget
context window
truncation
token counting
tiktoken
prompt size

Token Budget Patterns

LLM 가 input + output token 합한 limit. Track + truncate + summarize + dynamic budget. Cost + latency 가 token 수 비례. Smart RAG / message pruning / summary cascade.

📖 핵심 개념

  • Context window: 입력 + 출력 limit (e.g. 200k tokens).
  • Per-call cost = input × $/1k + output × $/1k.
  • Tokenizer 가 model 별 다름.
  • Output limit 이 input 보다 작음 (e.g. 200k in / 8k out).

💻 코드 패턴

Token counting (Anthropic / OpenAI)

// Anthropic
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();

const { input_tokens } = await client.messages.countTokens({
  model: 'claude-opus-4-7',
  messages,
});
// OpenAI tiktoken
import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4');
const tokens = enc.encode('Hello world');
console.log(tokens.length);  // 2
enc.free();

Approximate (no API)

// 근사: 1 token ≈ 4 char (English)
function estimateTokens(text: string): number {
  return Math.ceil(text.length / 4);
}

// 한글 = 1 token ≈ 1-2 char (worse)

→ 정확 = tokenizer. 근사 = quick budget.

Budget split

const MAX_CONTEXT = 200_000;
const MAX_OUTPUT = 8_192;

const budget = {
  system: 1_000,        // fixed prompt
  rag: 50_000,          // retrieval
  conversation: 100_000, // history
  user: 5_000,          // current message
  output: MAX_OUTPUT,
};

const sum = Object.values(budget).reduce((a, b) => a + b);
console.assert(sum <= MAX_CONTEXT);

→ 각 piece 의 limit 정함. 넘으면 truncate.

Conversation pruning

function prune(messages: Message[], maxTokens: number): Message[] {
  const result: Message[] = [];
  let used = 0;
  
  // 최신 → 옛 (최신 우선)
  for (let i = messages.length - 1; i >= 0; i--) {
    const t = countTokens(messages[i]);
    if (used + t > maxTokens) break;
    result.unshift(messages[i]);
    used += t;
  }
  
  return result;
}

→ Sliding window. 옛 message 잃음.

Summarization cascade

async function summarize(messages: Message[]): Promise<string> {
  const r = await llm.complete({
    system: 'Summarize this conversation in 200 tokens.',
    messages,
  });
  return r.text;
}

// 너무 길면 요약
if (count(messages) > 50_000) {
  const old = messages.slice(0, -10);
  const recent = messages.slice(-10);
  
  const summary = await summarize(old);
  return [
    { role: 'system', content: `Previous: ${summary}` },
    ...recent,
  ];
}

→ Old context lost detail, recent intact.

Hierarchical summary

1주: 매 10 message → 요약
1개월: 매 hour → 요약
1년: 매 day → 요약

→ Long-term memory tree.

Truncation strategy

type Strategy = 'head' | 'tail' | 'middle' | 'summary';

function truncate(text: string, maxTokens: number, strategy: Strategy = 'tail') {
  const tokens = enc.encode(text);
  if (tokens.length <= maxTokens) return text;
  
  switch (strategy) {
    case 'head':
      return enc.decode(tokens.slice(0, maxTokens));
    case 'tail':
      return enc.decode(tokens.slice(-maxTokens));
    case 'middle':
      const half = maxTokens / 2;
      return enc.decode(tokens.slice(0, half)) + '\n...[truncated]...\n' + enc.decode(tokens.slice(-half));
    case 'summary':
      return await summarize(text);
  }
}

Dynamic context (RAG)

async function buildContext(query: string, budget: number) {
  const candidates = await vectorSearch(query, k: 50);
  
  let used = 0;
  const selected = [];
  
  for (const doc of candidates) {
    const t = estimateTokens(doc.text);
    if (used + t > budget) break;
    selected.push(doc);
    used += t;
  }
  
  return selected;
}

→ Top-K → token budget 까지.

Prompt caching (Anthropic / OpenAI)

// Anthropic prompt caching
const r = await client.messages.create({
  model: 'claude-opus-4-7',
  system: [
    { type: 'text', text: BIG_SYSTEM_PROMPT, cache_control: { type: 'ephemeral' } },
  ],
  messages,
});

→ 같은 system / RAG → 90% cost ↓.

Cost calculation

const PRICING = {
  'claude-opus-4-7': { input: 15, output: 75 },  // $/MTok
  'claude-sonnet-4-6': { input: 3, output: 15 },
  'gpt-4o': { input: 2.5, output: 10 },
};

function cost(model: string, input: number, output: number) {
  const p = PRICING[model];
  return (input * p.input + output * p.output) / 1_000_000;
}

console.log(cost('claude-opus-4-7', 50_000, 5_000));  // $1.125

Streaming + early stop

const stream = await llm.stream({ messages });
let used = 0;
for await (const chunk of stream) {
  process.stdout.write(chunk.text);
  used += chunk.tokens;
  if (used > MAX_OUTPUT) break;  // safety
}

Stop sequences

await llm.complete({
  messages,
  stop_sequences: ['\n\n###', 'END'],
});
// → 만나면 stop, output token 안 씀

→ Output 의 boilerplate 줄이는 trick.

Output JSON 줄이기

❌ "Please reply with detailed JSON including..."
"{\n  \"answer\": \"...\",\n  ...\n}"

✅ "Reply: {answer, confidence}"
{"answer":"...","confidence":0.9}

→ Compact JSON, no whitespace.

큰 doc + 여러 query (split)

// Map-reduce
async function bigDoc(doc: string, query: string) {
  const chunks = split(doc, 50_000);
  const partials = await Promise.all(
    chunks.map(c => llm.complete({ system: query, messages: [{ role: 'user', content: c }] }))
  );
  
  // Reduce
  const combined = partials.map(p => p.text).join('\n\n---\n\n');
  return llm.complete({ system: 'Combine partial answers', messages: [{ role: 'user', content: combined }] });
}

Refine (sequential)

let answer = '';
for (const chunk of chunks) {
  answer = await llm.complete({
    system: `Refine answer. Current: ${answer}`,
    messages: [{ role: 'user', content: chunk }],
  });
}

Token-aware chunking (text)

function chunkByTokens(text: string, maxTokens: number, overlap: number) {
  const tokens = enc.encode(text);
  const chunks: string[] = [];
  for (let i = 0; i < tokens.length; i += maxTokens - overlap) {
    chunks.push(enc.decode(tokens.slice(i, i + maxTokens)));
  }
  return chunks;
}

→ Word boundary 안 맞을 수 있음 (overlap = sentence 보호).

Visualizer

function visualize(messages: Message[], max: number) {
  const counts = messages.map(m => ({ role: m.role, t: countTokens(m) }));
  const sum = counts.reduce((a, b) => a + b.t, 0);
  
  console.log(`Total: ${sum} / ${max} (${(sum / max * 100).toFixed(0)}%)`);
  for (const c of counts) {
    const bar = '█'.repeat(Math.floor(c.t / max * 50));
    console.log(`${c.role.padEnd(10)} ${c.t.toString().padStart(6)} ${bar}`);
  }
}

LangChain / LlamaIndex 자동

from langchain.memory import ConversationSummaryBufferMemory
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=2000,
    return_messages=True,
)

→ 자동 prune + summarize.

Context optimization

순서:
1. System (always)
2. RAG (relevant docs)
3. Conversation summary
4. Recent messages
5. Current user

각 = budget. 넘으면 truncate.

Long context vs RAG

Long context (200k+):
- Simple, 모두 in
- Cost 큼, slow

RAG:
- Embed + retrieve top-K
- Cost 작음, 빠름
- Tuning 필요

→ <50k = long context. >50k = RAG.

🤔 의사결정 기준

상황 추천
Token count Tokenizer (정확) / 4-char approx
Context > limit Prune / summarize
같은 system 자주 Prompt caching
큰 doc 1 query Map-reduce / refine
Long history Hierarchical summary
Cost 줄이기 Cache + smaller model + stop seq
Real-time Stream + early stop

안티패턴

  • Token count 안 추적: 한도 넘으면 error.
  • 모든 history 보냄: cost 폭발.
  • Truncation 없음: 한 자라도 over → 실패.
  • Cache 안 씀: 매번 system prompt full $.
  • Verbose JSON output: token 낭비.
  • 모든 doc RAG 보냄: noise + cost.
  • Output limit 무시: 잘림.

🤖 LLM 활용 힌트

  • Tokenizer (model 별) 항상 count.
  • Prompt caching = 큰 cost 절감.
  • Hierarchical summary = long memory.
  • RAG vs long context = size dependent.

🔗 관련 문서