[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,436 @@
|
||||
---
|
||||
id: ai-long-context-management
|
||||
title: Long Context — 1M+ token 사용 / Compression / Chunk
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, llm, context, vibe-coding]
|
||||
tech_stack: { language: "TS", applicable_to: ["Backend"] }
|
||||
applied_in: []
|
||||
aliases: [long context, context window, lost in the middle, recency bias, compression]
|
||||
---
|
||||
|
||||
# Long Context Management
|
||||
|
||||
> 1M+ token model (Gemini, Claude). **그러나 "lost in middle" — 시작 / 끝 가 가장 attended**. RAG / compression / hierarchical 의 가치 여전.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Context window: 1M+ (Gemini 2.5 Pro), 200K (Claude Opus).
|
||||
- Lost in middle: 중간 token 가장 잊혀짐.
|
||||
- Recency bias: 끝 가까이 가장 영향.
|
||||
- Token cost: 큰 context = 큰 비용.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### Long context model (2026)
|
||||
```
|
||||
Gemini 2.5 Pro: 2M+ tokens
|
||||
Claude Opus 4.7: 1M tokens
|
||||
GPT-4.1: 1M tokens
|
||||
Llama 3.3: 128K tokens
|
||||
```
|
||||
|
||||
→ 한 책 + 큰 codebase 가능.
|
||||
|
||||
### Lost in middle
|
||||
```
|
||||
Test:
|
||||
"이 문서 안 어딘가 'X' 가 있다. 'X' 는 무엇인가?"
|
||||
|
||||
위치별 accuracy:
|
||||
- 시작: 95%
|
||||
- 25%: 75%
|
||||
- 50%: 60%
|
||||
- 75%: 80%
|
||||
- 끝: 95%
|
||||
```
|
||||
|
||||
→ 중간 둘 데이터 = 잘 안 쓰임.
|
||||
|
||||
### Strategy 1: 중요 데이터 끝
|
||||
```ts
|
||||
const messages = [
|
||||
{ role: 'system', content: SYSTEM_PROMPT },
|
||||
{ role: 'user', content: `
|
||||
${largeContext}
|
||||
|
||||
# Recent / important context
|
||||
${importantStuff}
|
||||
|
||||
# Question
|
||||
${userQuery}
|
||||
` },
|
||||
];
|
||||
```
|
||||
|
||||
→ Model 가 끝 더 attend.
|
||||
|
||||
### Strategy 2: Retrieval + small context
|
||||
```
|
||||
Long context (1M) 일관 비싸 + 잃음.
|
||||
RAG (5K relevant chunks) 더 좋음 자주.
|
||||
|
||||
→ Relevance 가 Length 보다 중요.
|
||||
```
|
||||
|
||||
### Strategy 3: Hierarchical
|
||||
```
|
||||
1. Summarize each chunk (작은 LLM)
|
||||
2. Summary 가 context
|
||||
3. 필요 시 specific chunk 요청
|
||||
|
||||
[chunk 1 summary] [chunk 2 summary] ... [chunk 100 summary]
|
||||
↓
|
||||
"Need detail of chunk 47" → fetch full
|
||||
```
|
||||
|
||||
→ Long doc 의 navigation.
|
||||
|
||||
### Strategy 4: Multi-step
|
||||
```ts
|
||||
// Step 1: Question understanding
|
||||
const questionType = await llm.analyze(query);
|
||||
|
||||
// Step 2: Relevant section (작은 model)
|
||||
const sections = await llm.identify(largeDoc, questionType);
|
||||
|
||||
// Step 3: Detailed answer (big model)
|
||||
const answer = await llm.complete({
|
||||
context: sections,
|
||||
query,
|
||||
});
|
||||
```
|
||||
|
||||
→ Retrieval + reasoning 분리.
|
||||
|
||||
### Strategy 5: Compression
|
||||
```ts
|
||||
// LLMLingua / LongLLMLingua
|
||||
// Original: 10K tokens
|
||||
// Compressed: 2K tokens (key info 만)
|
||||
|
||||
import { compress } from 'llmlingua-js';
|
||||
const compressed = await compress(longText, { ratio: 0.3 });
|
||||
```
|
||||
|
||||
→ 70% token 줄임. Accuracy 유지.
|
||||
|
||||
### Sliding window (chat history)
|
||||
```ts
|
||||
function trimHistory(messages: Message[], maxTokens: number): Message[] {
|
||||
let total = 0;
|
||||
const result: Message[] = [];
|
||||
|
||||
// Keep system message
|
||||
if (messages[0].role === 'system') {
|
||||
result.push(messages[0]);
|
||||
total += countTokens(messages[0].content);
|
||||
}
|
||||
|
||||
// Add recent messages first
|
||||
for (let i = messages.length - 1; i >= (result.length > 0 ? 1 : 0); i--) {
|
||||
const tokens = countTokens(messages[i].content);
|
||||
if (total + tokens > maxTokens) break;
|
||||
total += tokens;
|
||||
result.splice(result.length > 0 && result[0].role === 'system' ? 1 : 0, 0, messages[i]);
|
||||
}
|
||||
|
||||
return result;
|
||||
}
|
||||
```
|
||||
|
||||
### Summarization 가 옛 messages
|
||||
```ts
|
||||
async function condenseHistory(messages: Message[]): Promise<Message[]> {
|
||||
if (messages.length < 20) return messages;
|
||||
|
||||
const old = messages.slice(0, -10);
|
||||
const recent = messages.slice(-10);
|
||||
|
||||
const summary = await llm.complete({
|
||||
system: 'Summarize this conversation in 200 words. Keep key facts.',
|
||||
user: old.map(m => `${m.role}: ${m.content}`).join('\n'),
|
||||
});
|
||||
|
||||
return [
|
||||
{ role: 'system', content: `Earlier conversation summary:\n${summary}` },
|
||||
...recent,
|
||||
];
|
||||
}
|
||||
```
|
||||
|
||||
→ Context window 안 머무름.
|
||||
|
||||
### Caching (Anthropic)
|
||||
```ts
|
||||
// 큰 context 가 자주 같음 → cache
|
||||
const r = await anthropic.messages.create({
|
||||
model: 'claude-opus-4-7',
|
||||
system: [
|
||||
{
|
||||
type: 'text',
|
||||
text: hugeDoc, // 200K tokens
|
||||
cache_control: { type: 'ephemeral', ttl: '1h' },
|
||||
},
|
||||
],
|
||||
messages: [{ role: 'user', content: question }],
|
||||
});
|
||||
```
|
||||
|
||||
→ 90% cost 절감 후속 호출.
|
||||
|
||||
→ [[AI_Prompt_Caching]].
|
||||
|
||||
### Chunking strategy
|
||||
```
|
||||
Fixed size: simple, but 의미 cut.
|
||||
Sentence: 자연.
|
||||
Paragraph: 의미 단위.
|
||||
Section (heading): 큰 boundary.
|
||||
Semantic: LLM 가 boundary 결정.
|
||||
|
||||
→ 가장 의미 있는 boundary.
|
||||
```
|
||||
|
||||
```ts
|
||||
function smartChunk(doc: string, maxTokens = 1000): string[] {
|
||||
// Split by markdown header first
|
||||
const sections = doc.split(/\n##\s+/);
|
||||
|
||||
const chunks: string[] = [];
|
||||
for (const section of sections) {
|
||||
if (countTokens(section) <= maxTokens) {
|
||||
chunks.push(section);
|
||||
} else {
|
||||
// 더 split (paragraph)
|
||||
chunks.push(...splitByParagraph(section, maxTokens));
|
||||
}
|
||||
}
|
||||
return chunks;
|
||||
}
|
||||
```
|
||||
|
||||
### Semantic chunking
|
||||
```ts
|
||||
async function semanticChunk(text: string): Promise<string[]> {
|
||||
const sentences = text.split(/[.!?]\s+/);
|
||||
const embeddings = await Promise.all(sentences.map(embed));
|
||||
|
||||
const chunks: string[] = [];
|
||||
let current: string[] = [sentences[0]];
|
||||
|
||||
for (let i = 1; i < sentences.length; i++) {
|
||||
const sim = cosine(embeddings[i - 1], embeddings[i]);
|
||||
if (sim < 0.7) {
|
||||
// Boundary
|
||||
chunks.push(current.join('. '));
|
||||
current = [sentences[i]];
|
||||
} else {
|
||||
current.push(sentences[i]);
|
||||
}
|
||||
}
|
||||
chunks.push(current.join('. '));
|
||||
|
||||
return chunks;
|
||||
}
|
||||
```
|
||||
|
||||
→ 의미 변화 = chunk boundary.
|
||||
|
||||
### Map-reduce (long doc)
|
||||
```ts
|
||||
// Map: 각 chunk 요약
|
||||
const summaries = await Promise.all(chunks.map(chunk =>
|
||||
llm.summarize(chunk)
|
||||
));
|
||||
|
||||
// Reduce: summaries 합치기
|
||||
const final = await llm.complete({
|
||||
user: `Synthesize these summaries:\n${summaries.join('\n')}\n\nQuestion: ${query}`,
|
||||
});
|
||||
```
|
||||
|
||||
→ 분산 처리.
|
||||
|
||||
### Refine (iterative)
|
||||
```ts
|
||||
let answer = '';
|
||||
for (const chunk of chunks) {
|
||||
answer = await llm.complete({
|
||||
system: `Refine the answer based on new info.\nCurrent: ${answer}`,
|
||||
user: `New info: ${chunk}\nQuestion: ${query}`,
|
||||
});
|
||||
}
|
||||
```
|
||||
|
||||
→ 점진 개선.
|
||||
|
||||
### Context window 계산
|
||||
```ts
|
||||
import { encoding_for_model } from 'tiktoken';
|
||||
|
||||
const enc = encoding_for_model('gpt-4o');
|
||||
|
||||
function countTokens(text: string): number {
|
||||
return enc.encode(text).length;
|
||||
}
|
||||
|
||||
function fitsInContext(text: string, max: number): boolean {
|
||||
return countTokens(text) < max;
|
||||
}
|
||||
|
||||
// 매 model 다른 budget
|
||||
const BUDGETS = {
|
||||
'gpt-4o': 128_000 - 16_000, // 16K reserved for output
|
||||
'claude-opus-4-7': 200_000 - 16_000,
|
||||
'gemini-2.5-pro': 2_000_000 - 64_000,
|
||||
};
|
||||
```
|
||||
|
||||
### Cost estimation
|
||||
```ts
|
||||
function estimateCost(tokens: number, model: string): number {
|
||||
const rates: Record<string, [number, number]> = {
|
||||
'gpt-4o': [2.5, 10], // $/1M (input, output)
|
||||
'claude-opus-4-7': [15, 75],
|
||||
'gemini-2.5-pro': [2.5, 15],
|
||||
};
|
||||
const [input, output] = rates[model];
|
||||
return (tokens / 1_000_000) * input;
|
||||
}
|
||||
|
||||
// 1M tokens × Claude = $15 input
|
||||
// → Cache 가 90% 절감
|
||||
```
|
||||
|
||||
### Long context use case
|
||||
```
|
||||
✅ 한 큰 doc 분석 (book, codebase, log)
|
||||
✅ 코드 review (whole file)
|
||||
✅ Document Q&A (single doc)
|
||||
✅ Comparison (multi doc)
|
||||
|
||||
⚠️ Latency 느림 (1M token = 30s+)
|
||||
⚠️ Cost 큼
|
||||
⚠️ Lost in middle
|
||||
```
|
||||
|
||||
### Long context vs RAG
|
||||
```
|
||||
Long context:
|
||||
+ 단순 — 모든 거 inject
|
||||
+ 정밀 (cherry-pick 안 함)
|
||||
- 비싸
|
||||
- 느림
|
||||
- Lost in middle
|
||||
|
||||
RAG:
|
||||
+ 빠름
|
||||
+ Cheap
|
||||
+ Scale (큰 corpus)
|
||||
- Retrieval quality 중요
|
||||
- 잘못된 chunk = 잘못된 답
|
||||
|
||||
→ 상황 별 mix.
|
||||
```
|
||||
|
||||
### Hybrid
|
||||
```ts
|
||||
async function answer(query: string, document: string) {
|
||||
if (countTokens(document) < 50_000) {
|
||||
// Small enough — direct
|
||||
return await llm.complete({ context: document, query });
|
||||
} else {
|
||||
// Large — RAG first
|
||||
const chunks = chunkAndEmbed(document);
|
||||
const relevant = await semanticSearch(query, chunks, 10);
|
||||
return await llm.complete({ context: relevant.join('\n'), query });
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Streaming + long context
|
||||
```ts
|
||||
// Long context = 큰 input, but output stream 가능
|
||||
const stream = await openai.chat.completions.create({
|
||||
model: 'gpt-4.1',
|
||||
messages: [...],
|
||||
stream: true,
|
||||
});
|
||||
|
||||
for await (const chunk of stream) {
|
||||
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
|
||||
}
|
||||
```
|
||||
|
||||
### Eval (long context)
|
||||
```
|
||||
- Needle in haystack: 1개 fact 가 N 위치 — accuracy
|
||||
- Multi-needle: 여러 fact
|
||||
- Reasoning across: 다른 chunk 의 fact 연결
|
||||
```
|
||||
|
||||
### Token budget allocation
|
||||
```ts
|
||||
const TOTAL = 128_000;
|
||||
const RESPONSE = 16_000;
|
||||
const SYSTEM = 2_000;
|
||||
const HISTORY = 30_000;
|
||||
const CONTEXT = TOTAL - RESPONSE - SYSTEM - HISTORY;
|
||||
|
||||
// Document 가 CONTEXT 보다 크면 — chunk + retrieve
|
||||
```
|
||||
|
||||
### Continual chat
|
||||
```ts
|
||||
class ChatSession {
|
||||
private messages: Message[] = [];
|
||||
private maxTokens = 100_000;
|
||||
|
||||
async send(userMsg: string) {
|
||||
this.messages.push({ role: 'user', content: userMsg });
|
||||
|
||||
// Trim if needed
|
||||
if (countTokens(this.messages) > this.maxTokens) {
|
||||
this.messages = await condenseHistory(this.messages);
|
||||
}
|
||||
|
||||
const r = await llm.complete({ messages: this.messages });
|
||||
this.messages.push({ role: 'assistant', content: r });
|
||||
return r;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 상황 | 추천 |
|
||||
|---|---|
|
||||
| 작은 doc (< 30K tokens) | Direct |
|
||||
| Medium (30-200K) | Direct + cache |
|
||||
| Large (200K+) | RAG + retrieved chunks |
|
||||
| Multiple docs | RAG |
|
||||
| Single doc 깊이 | Direct (long context) |
|
||||
| Long conversation | Sliding + summarize |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **모든 거 inject — context 가정 perfect**: lost in middle.
|
||||
- **Critical info 중간**: 끝 으로.
|
||||
- **Cache 무 + 같은 context 반복**: 비용.
|
||||
- **History 무한**: token 폭발.
|
||||
- **RAG vs Long context — 양자택일**: hybrid.
|
||||
- **Sentence cut chunking**: 의미 잃음.
|
||||
- **Token count 무시**: error / cost shock.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- Lost in middle — 끝 가까이 두기.
|
||||
- Cache 큰 context.
|
||||
- RAG + long context = best.
|
||||
- Tiktoken 으로 사전 measure.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_RAG_Pattern_Basics]]
|
||||
- [[AI_Prompt_Caching]]
|
||||
- [[AI_RAG_Advanced]]
|
||||
Reference in New Issue
Block a user