Files
2nd/10_Wiki/Topics/Coding/AI_Long_Context_Management.md
T
2026-05-09 22:47:42 +09:00

9.9 KiB
Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-long-context-management Long Context — 1M+ token 사용 / Compression / Chunk Coding draft B conceptual 2026-05-09 2026-05-09
ai
llm
context
vibe-coding
language applicable_to
TS
Backend
long context
context window
lost in the middle
recency bias
compression

Long Context Management

1M+ token model (Gemini, Claude). 그러나 "lost in middle" — 시작 / 끝 가 가장 attended. RAG / compression / hierarchical 의 가치 여전.

📖 핵심 개념

  • Context window: 1M+ (Gemini 2.5 Pro), 200K (Claude Opus).
  • Lost in middle: 중간 token 가장 잊혀짐.
  • Recency bias: 끝 가까이 가장 영향.
  • Token cost: 큰 context = 큰 비용.

💻 코드 패턴

Long context model (2026)

Gemini 2.5 Pro:    2M+ tokens
Claude Opus 4.7:   1M tokens
GPT-4.1:           1M tokens
Llama 3.3:         128K tokens

→ 한 책 + 큰 codebase 가능.

Lost in middle

Test:
"이 문서 안 어딘가 'X' 가 있다. 'X' 는 무엇인가?"

위치별 accuracy:
- 시작:  95%
- 25%:   75%
- 50%:   60%
- 75%:   80%
- 끝:    95%

→ 중간 둘 데이터 = 잘 안 쓰임.

Strategy 1: 중요 데이터 끝

const messages = [
  { role: 'system', content: SYSTEM_PROMPT },
  { role: 'user', content: `
${largeContext}

# Recent / important context
${importantStuff}

# Question
${userQuery}
` },
];

→ Model 가 끝 더 attend.

Strategy 2: Retrieval + small context

Long context (1M) 일관 비싸 + 잃음.
RAG (5K relevant chunks) 더 좋음 자주.

→ Relevance 가 Length 보다 중요.

Strategy 3: Hierarchical

1. Summarize each chunk (작은 LLM)
2. Summary 가 context
3. 필요 시 specific chunk 요청

[chunk 1 summary] [chunk 2 summary] ... [chunk 100 summary]
↓
"Need detail of chunk 47" → fetch full

→ Long doc 의 navigation.

Strategy 4: Multi-step

// Step 1: Question understanding
const questionType = await llm.analyze(query);

// Step 2: Relevant section (작은 model)
const sections = await llm.identify(largeDoc, questionType);

// Step 3: Detailed answer (big model)
const answer = await llm.complete({
  context: sections,
  query,
});

→ Retrieval + reasoning 분리.

Strategy 5: Compression

// LLMLingua / LongLLMLingua
// Original: 10K tokens
// Compressed: 2K tokens (key info 만)

import { compress } from 'llmlingua-js';
const compressed = await compress(longText, { ratio: 0.3 });

→ 70% token 줄임. Accuracy 유지.

Sliding window (chat history)

function trimHistory(messages: Message[], maxTokens: number): Message[] {
  let total = 0;
  const result: Message[] = [];
  
  // Keep system message
  if (messages[0].role === 'system') {
    result.push(messages[0]);
    total += countTokens(messages[0].content);
  }
  
  // Add recent messages first
  for (let i = messages.length - 1; i >= (result.length > 0 ? 1 : 0); i--) {
    const tokens = countTokens(messages[i].content);
    if (total + tokens > maxTokens) break;
    total += tokens;
    result.splice(result.length > 0 && result[0].role === 'system' ? 1 : 0, 0, messages[i]);
  }
  
  return result;
}

Summarization 가 옛 messages

async function condenseHistory(messages: Message[]): Promise<Message[]> {
  if (messages.length < 20) return messages;
  
  const old = messages.slice(0, -10);
  const recent = messages.slice(-10);
  
  const summary = await llm.complete({
    system: 'Summarize this conversation in 200 words. Keep key facts.',
    user: old.map(m => `${m.role}: ${m.content}`).join('\n'),
  });
  
  return [
    { role: 'system', content: `Earlier conversation summary:\n${summary}` },
    ...recent,
  ];
}

→ Context window 안 머무름.

Caching (Anthropic)

// 큰 context 가 자주 같음 → cache
const r = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  system: [
    {
      type: 'text',
      text: hugeDoc,  // 200K tokens
      cache_control: { type: 'ephemeral', ttl: '1h' },
    },
  ],
  messages: [{ role: 'user', content: question }],
});

→ 90% cost 절감 후속 호출.

AI_Prompt_Caching.

Chunking strategy

Fixed size: simple, but 의미 cut.
Sentence: 자연.
Paragraph: 의미 단위.
Section (heading): 큰 boundary.
Semantic: LLM 가 boundary 결정.

→ 가장 의미 있는 boundary.
function smartChunk(doc: string, maxTokens = 1000): string[] {
  // Split by markdown header first
  const sections = doc.split(/\n##\s+/);
  
  const chunks: string[] = [];
  for (const section of sections) {
    if (countTokens(section) <= maxTokens) {
      chunks.push(section);
    } else {
      // 더 split (paragraph)
      chunks.push(...splitByParagraph(section, maxTokens));
    }
  }
  return chunks;
}

Semantic chunking

async function semanticChunk(text: string): Promise<string[]> {
  const sentences = text.split(/[.!?]\s+/);
  const embeddings = await Promise.all(sentences.map(embed));
  
  const chunks: string[] = [];
  let current: string[] = [sentences[0]];
  
  for (let i = 1; i < sentences.length; i++) {
    const sim = cosine(embeddings[i - 1], embeddings[i]);
    if (sim < 0.7) {
      // Boundary
      chunks.push(current.join('. '));
      current = [sentences[i]];
    } else {
      current.push(sentences[i]);
    }
  }
  chunks.push(current.join('. '));
  
  return chunks;
}

→ 의미 변화 = chunk boundary.

Map-reduce (long doc)

// Map: 각 chunk 요약
const summaries = await Promise.all(chunks.map(chunk => 
  llm.summarize(chunk)
));

// Reduce: summaries 합치기
const final = await llm.complete({
  user: `Synthesize these summaries:\n${summaries.join('\n')}\n\nQuestion: ${query}`,
});

→ 분산 처리.

Refine (iterative)

let answer = '';
for (const chunk of chunks) {
  answer = await llm.complete({
    system: `Refine the answer based on new info.\nCurrent: ${answer}`,
    user: `New info: ${chunk}\nQuestion: ${query}`,
  });
}

→ 점진 개선.

Context window 계산

import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o');

function countTokens(text: string): number {
  return enc.encode(text).length;
}

function fitsInContext(text: string, max: number): boolean {
  return countTokens(text) < max;
}

// 매 model 다른 budget
const BUDGETS = {
  'gpt-4o': 128_000 - 16_000,  // 16K reserved for output
  'claude-opus-4-7': 200_000 - 16_000,
  'gemini-2.5-pro': 2_000_000 - 64_000,
};

Cost estimation

function estimateCost(tokens: number, model: string): number {
  const rates: Record<string, [number, number]> = {
    'gpt-4o': [2.5, 10],  // $/1M (input, output)
    'claude-opus-4-7': [15, 75],
    'gemini-2.5-pro': [2.5, 15],
  };
  const [input, output] = rates[model];
  return (tokens / 1_000_000) * input;
}

// 1M tokens × Claude = $15 input
// → Cache 가 90% 절감

Long context use case

✅ 한 큰 doc 분석 (book, codebase, log)
✅ 코드 review (whole file)
✅ Document Q&A (single doc)
✅ Comparison (multi doc)

⚠ Latency 느림 (1M token = 30s+)
⚠️ Cost 큼
⚠️ Lost in middle

Long context vs RAG

Long context:
+ 단순 — 모든 거 inject
+ 정밀 (cherry-pick 안 함)
- 비싸
- 느림
- Lost in middle

RAG:
+ 빠름
+ Cheap
+ Scale (큰 corpus)
- Retrieval quality 중요
- 잘못된 chunk = 잘못된 답

→ 상황 별 mix.

Hybrid

async function answer(query: string, document: string) {
  if (countTokens(document) < 50_000) {
    // Small enough — direct
    return await llm.complete({ context: document, query });
  } else {
    // Large — RAG first
    const chunks = chunkAndEmbed(document);
    const relevant = await semanticSearch(query, chunks, 10);
    return await llm.complete({ context: relevant.join('\n'), query });
  }
}

Streaming + long context

// Long context = 큰 input, but output stream 가능
const stream = await openai.chat.completions.create({
  model: 'gpt-4.1',
  messages: [...],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}

Eval (long context)

- Needle in haystack: 1개 fact 가 N 위치 — accuracy
- Multi-needle: 여러 fact
- Reasoning across: 다른 chunk 의 fact 연결

Token budget allocation

const TOTAL = 128_000;
const RESPONSE = 16_000;
const SYSTEM = 2_000;
const HISTORY = 30_000;
const CONTEXT = TOTAL - RESPONSE - SYSTEM - HISTORY;

// Document 가 CONTEXT 보다 크면 — chunk + retrieve

Continual chat

class ChatSession {
  private messages: Message[] = [];
  private maxTokens = 100_000;
  
  async send(userMsg: string) {
    this.messages.push({ role: 'user', content: userMsg });
    
    // Trim if needed
    if (countTokens(this.messages) > this.maxTokens) {
      this.messages = await condenseHistory(this.messages);
    }
    
    const r = await llm.complete({ messages: this.messages });
    this.messages.push({ role: 'assistant', content: r });
    return r;
  }
}

🤔 의사결정 기준

상황 추천
작은 doc (< 30K tokens) Direct
Medium (30-200K) Direct + cache
Large (200K+) RAG + retrieved chunks
Multiple docs RAG
Single doc 깊이 Direct (long context)
Long conversation Sliding + summarize

안티패턴

  • 모든 거 inject — context 가정 perfect: lost in middle.
  • Critical info 중간: 끝 으로.
  • Cache 무 + 같은 context 반복: 비용.
  • History 무한: token 폭발.
  • RAG vs Long context — 양자택일: hybrid.
  • Sentence cut chunking: 의미 잃음.
  • Token count 무시: error / cost shock.

🤖 LLM 활용 힌트

  • Lost in middle — 끝 가까이 두기.
  • Cache 큰 context.
  • RAG + long context = best.
  • Tiktoken 으로 사전 measure.

🔗 관련 문서