9.9 KiB
9.9 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ai-long-context-management | Long Context — 1M+ token 사용 / Compression / Chunk | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Long Context Management
1M+ token model (Gemini, Claude). 그러나 "lost in middle" — 시작 / 끝 가 가장 attended. RAG / compression / hierarchical 의 가치 여전.
📖 핵심 개념
- Context window: 1M+ (Gemini 2.5 Pro), 200K (Claude Opus).
- Lost in middle: 중간 token 가장 잊혀짐.
- Recency bias: 끝 가까이 가장 영향.
- Token cost: 큰 context = 큰 비용.
💻 코드 패턴
Long context model (2026)
Gemini 2.5 Pro: 2M+ tokens
Claude Opus 4.7: 1M tokens
GPT-4.1: 1M tokens
Llama 3.3: 128K tokens
→ 한 책 + 큰 codebase 가능.
Lost in middle
Test:
"이 문서 안 어딘가 'X' 가 있다. 'X' 는 무엇인가?"
위치별 accuracy:
- 시작: 95%
- 25%: 75%
- 50%: 60%
- 75%: 80%
- 끝: 95%
→ 중간 둘 데이터 = 잘 안 쓰임.
Strategy 1: 중요 데이터 끝
const messages = [
{ role: 'system', content: SYSTEM_PROMPT },
{ role: 'user', content: `
${largeContext}
# Recent / important context
${importantStuff}
# Question
${userQuery}
` },
];
→ Model 가 끝 더 attend.
Strategy 2: Retrieval + small context
Long context (1M) 일관 비싸 + 잃음.
RAG (5K relevant chunks) 더 좋음 자주.
→ Relevance 가 Length 보다 중요.
Strategy 3: Hierarchical
1. Summarize each chunk (작은 LLM)
2. Summary 가 context
3. 필요 시 specific chunk 요청
[chunk 1 summary] [chunk 2 summary] ... [chunk 100 summary]
↓
"Need detail of chunk 47" → fetch full
→ Long doc 의 navigation.
Strategy 4: Multi-step
// Step 1: Question understanding
const questionType = await llm.analyze(query);
// Step 2: Relevant section (작은 model)
const sections = await llm.identify(largeDoc, questionType);
// Step 3: Detailed answer (big model)
const answer = await llm.complete({
context: sections,
query,
});
→ Retrieval + reasoning 분리.
Strategy 5: Compression
// LLMLingua / LongLLMLingua
// Original: 10K tokens
// Compressed: 2K tokens (key info 만)
import { compress } from 'llmlingua-js';
const compressed = await compress(longText, { ratio: 0.3 });
→ 70% token 줄임. Accuracy 유지.
Sliding window (chat history)
function trimHistory(messages: Message[], maxTokens: number): Message[] {
let total = 0;
const result: Message[] = [];
// Keep system message
if (messages[0].role === 'system') {
result.push(messages[0]);
total += countTokens(messages[0].content);
}
// Add recent messages first
for (let i = messages.length - 1; i >= (result.length > 0 ? 1 : 0); i--) {
const tokens = countTokens(messages[i].content);
if (total + tokens > maxTokens) break;
total += tokens;
result.splice(result.length > 0 && result[0].role === 'system' ? 1 : 0, 0, messages[i]);
}
return result;
}
Summarization 가 옛 messages
async function condenseHistory(messages: Message[]): Promise<Message[]> {
if (messages.length < 20) return messages;
const old = messages.slice(0, -10);
const recent = messages.slice(-10);
const summary = await llm.complete({
system: 'Summarize this conversation in 200 words. Keep key facts.',
user: old.map(m => `${m.role}: ${m.content}`).join('\n'),
});
return [
{ role: 'system', content: `Earlier conversation summary:\n${summary}` },
...recent,
];
}
→ Context window 안 머무름.
Caching (Anthropic)
// 큰 context 가 자주 같음 → cache
const r = await anthropic.messages.create({
model: 'claude-opus-4-7',
system: [
{
type: 'text',
text: hugeDoc, // 200K tokens
cache_control: { type: 'ephemeral', ttl: '1h' },
},
],
messages: [{ role: 'user', content: question }],
});
→ 90% cost 절감 후속 호출.
Chunking strategy
Fixed size: simple, but 의미 cut.
Sentence: 자연.
Paragraph: 의미 단위.
Section (heading): 큰 boundary.
Semantic: LLM 가 boundary 결정.
→ 가장 의미 있는 boundary.
function smartChunk(doc: string, maxTokens = 1000): string[] {
// Split by markdown header first
const sections = doc.split(/\n##\s+/);
const chunks: string[] = [];
for (const section of sections) {
if (countTokens(section) <= maxTokens) {
chunks.push(section);
} else {
// 더 split (paragraph)
chunks.push(...splitByParagraph(section, maxTokens));
}
}
return chunks;
}
Semantic chunking
async function semanticChunk(text: string): Promise<string[]> {
const sentences = text.split(/[.!?]\s+/);
const embeddings = await Promise.all(sentences.map(embed));
const chunks: string[] = [];
let current: string[] = [sentences[0]];
for (let i = 1; i < sentences.length; i++) {
const sim = cosine(embeddings[i - 1], embeddings[i]);
if (sim < 0.7) {
// Boundary
chunks.push(current.join('. '));
current = [sentences[i]];
} else {
current.push(sentences[i]);
}
}
chunks.push(current.join('. '));
return chunks;
}
→ 의미 변화 = chunk boundary.
Map-reduce (long doc)
// Map: 각 chunk 요약
const summaries = await Promise.all(chunks.map(chunk =>
llm.summarize(chunk)
));
// Reduce: summaries 합치기
const final = await llm.complete({
user: `Synthesize these summaries:\n${summaries.join('\n')}\n\nQuestion: ${query}`,
});
→ 분산 처리.
Refine (iterative)
let answer = '';
for (const chunk of chunks) {
answer = await llm.complete({
system: `Refine the answer based on new info.\nCurrent: ${answer}`,
user: `New info: ${chunk}\nQuestion: ${query}`,
});
}
→ 점진 개선.
Context window 계산
import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
function countTokens(text: string): number {
return enc.encode(text).length;
}
function fitsInContext(text: string, max: number): boolean {
return countTokens(text) < max;
}
// 매 model 다른 budget
const BUDGETS = {
'gpt-4o': 128_000 - 16_000, // 16K reserved for output
'claude-opus-4-7': 200_000 - 16_000,
'gemini-2.5-pro': 2_000_000 - 64_000,
};
Cost estimation
function estimateCost(tokens: number, model: string): number {
const rates: Record<string, [number, number]> = {
'gpt-4o': [2.5, 10], // $/1M (input, output)
'claude-opus-4-7': [15, 75],
'gemini-2.5-pro': [2.5, 15],
};
const [input, output] = rates[model];
return (tokens / 1_000_000) * input;
}
// 1M tokens × Claude = $15 input
// → Cache 가 90% 절감
Long context use case
✅ 한 큰 doc 분석 (book, codebase, log)
✅ 코드 review (whole file)
✅ Document Q&A (single doc)
✅ Comparison (multi doc)
⚠️ Latency 느림 (1M token = 30s+)
⚠️ Cost 큼
⚠️ Lost in middle
Long context vs RAG
Long context:
+ 단순 — 모든 거 inject
+ 정밀 (cherry-pick 안 함)
- 비싸
- 느림
- Lost in middle
RAG:
+ 빠름
+ Cheap
+ Scale (큰 corpus)
- Retrieval quality 중요
- 잘못된 chunk = 잘못된 답
→ 상황 별 mix.
Hybrid
async function answer(query: string, document: string) {
if (countTokens(document) < 50_000) {
// Small enough — direct
return await llm.complete({ context: document, query });
} else {
// Large — RAG first
const chunks = chunkAndEmbed(document);
const relevant = await semanticSearch(query, chunks, 10);
return await llm.complete({ context: relevant.join('\n'), query });
}
}
Streaming + long context
// Long context = 큰 input, but output stream 가능
const stream = await openai.chat.completions.create({
model: 'gpt-4.1',
messages: [...],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
Eval (long context)
- Needle in haystack: 1개 fact 가 N 위치 — accuracy
- Multi-needle: 여러 fact
- Reasoning across: 다른 chunk 의 fact 연결
Token budget allocation
const TOTAL = 128_000;
const RESPONSE = 16_000;
const SYSTEM = 2_000;
const HISTORY = 30_000;
const CONTEXT = TOTAL - RESPONSE - SYSTEM - HISTORY;
// Document 가 CONTEXT 보다 크면 — chunk + retrieve
Continual chat
class ChatSession {
private messages: Message[] = [];
private maxTokens = 100_000;
async send(userMsg: string) {
this.messages.push({ role: 'user', content: userMsg });
// Trim if needed
if (countTokens(this.messages) > this.maxTokens) {
this.messages = await condenseHistory(this.messages);
}
const r = await llm.complete({ messages: this.messages });
this.messages.push({ role: 'assistant', content: r });
return r;
}
}
🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 작은 doc (< 30K tokens) | Direct |
| Medium (30-200K) | Direct + cache |
| Large (200K+) | RAG + retrieved chunks |
| Multiple docs | RAG |
| Single doc 깊이 | Direct (long context) |
| Long conversation | Sliding + summarize |
❌ 안티패턴
- 모든 거 inject — context 가정 perfect: lost in middle.
- Critical info 중간: 끝 으로.
- Cache 무 + 같은 context 반복: 비용.
- History 무한: token 폭발.
- RAG vs Long context — 양자택일: hybrid.
- Sentence cut chunking: 의미 잃음.
- Token count 무시: error / cost shock.
🤖 LLM 활용 힌트
- Lost in middle — 끝 가까이 두기.
- Cache 큰 context.
- RAG + long context = best.
- Tiktoken 으로 사전 measure.