--- id: ai-long-context-management title: Long Context — 1M+ token 사용 / Compression / Chunk category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, llm, context, vibe-coding] tech_stack: { language: "TS", applicable_to: ["Backend"] } applied_in: [] aliases: [long context, context window, lost in the middle, recency bias, compression] --- # Long Context Management > 1M+ token model (Gemini, Claude). **그러나 "lost in middle" — 시작 / 끝 가 가장 attended**. RAG / compression / hierarchical 의 가치 여전. ## 📖 핵심 개념 - Context window: 1M+ (Gemini 2.5 Pro), 200K (Claude Opus). - Lost in middle: 중간 token 가장 잊혀짐. - Recency bias: 끝 가까이 가장 영향. - Token cost: 큰 context = 큰 비용. ## 💻 코드 패턴 ### Long context model (2026) ``` Gemini 2.5 Pro: 2M+ tokens Claude Opus 4.7: 1M tokens GPT-4.1: 1M tokens Llama 3.3: 128K tokens ``` → 한 책 + 큰 codebase 가능. ### Lost in middle ``` Test: "이 문서 안 어딘가 'X' 가 있다. 'X' 는 무엇인가?" 위치별 accuracy: - 시작: 95% - 25%: 75% - 50%: 60% - 75%: 80% - 끝: 95% ``` → 중간 둘 데이터 = 잘 안 쓰임. ### Strategy 1: 중요 데이터 끝 ```ts const messages = [ { role: 'system', content: SYSTEM_PROMPT }, { role: 'user', content: ` ${largeContext} # Recent / important context ${importantStuff} # Question ${userQuery} ` }, ]; ``` → Model 가 끝 더 attend. ### Strategy 2: Retrieval + small context ``` Long context (1M) 일관 비싸 + 잃음. RAG (5K relevant chunks) 더 좋음 자주. → Relevance 가 Length 보다 중요. ``` ### Strategy 3: Hierarchical ``` 1. Summarize each chunk (작은 LLM) 2. Summary 가 context 3. 필요 시 specific chunk 요청 [chunk 1 summary] [chunk 2 summary] ... [chunk 100 summary] ↓ "Need detail of chunk 47" → fetch full ``` → Long doc 의 navigation. ### Strategy 4: Multi-step ```ts // Step 1: Question understanding const questionType = await llm.analyze(query); // Step 2: Relevant section (작은 model) const sections = await llm.identify(largeDoc, questionType); // Step 3: Detailed answer (big model) const answer = await llm.complete({ context: sections, query, }); ``` → Retrieval + reasoning 분리. ### Strategy 5: Compression ```ts // LLMLingua / LongLLMLingua // Original: 10K tokens // Compressed: 2K tokens (key info 만) import { compress } from 'llmlingua-js'; const compressed = await compress(longText, { ratio: 0.3 }); ``` → 70% token 줄임. Accuracy 유지. ### Sliding window (chat history) ```ts function trimHistory(messages: Message[], maxTokens: number): Message[] { let total = 0; const result: Message[] = []; // Keep system message if (messages[0].role === 'system') { result.push(messages[0]); total += countTokens(messages[0].content); } // Add recent messages first for (let i = messages.length - 1; i >= (result.length > 0 ? 1 : 0); i--) { const tokens = countTokens(messages[i].content); if (total + tokens > maxTokens) break; total += tokens; result.splice(result.length > 0 && result[0].role === 'system' ? 1 : 0, 0, messages[i]); } return result; } ``` ### Summarization 가 옛 messages ```ts async function condenseHistory(messages: Message[]): Promise { if (messages.length < 20) return messages; const old = messages.slice(0, -10); const recent = messages.slice(-10); const summary = await llm.complete({ system: 'Summarize this conversation in 200 words. Keep key facts.', user: old.map(m => `${m.role}: ${m.content}`).join('\n'), }); return [ { role: 'system', content: `Earlier conversation summary:\n${summary}` }, ...recent, ]; } ``` → Context window 안 머무름. ### Caching (Anthropic) ```ts // 큰 context 가 자주 같음 → cache const r = await anthropic.messages.create({ model: 'claude-opus-4-7', system: [ { type: 'text', text: hugeDoc, // 200K tokens cache_control: { type: 'ephemeral', ttl: '1h' }, }, ], messages: [{ role: 'user', content: question }], }); ``` → 90% cost 절감 후속 호출. → [[AI_Prompt_Caching]]. ### Chunking strategy ``` Fixed size: simple, but 의미 cut. Sentence: 자연. Paragraph: 의미 단위. Section (heading): 큰 boundary. Semantic: LLM 가 boundary 결정. → 가장 의미 있는 boundary. ``` ```ts function smartChunk(doc: string, maxTokens = 1000): string[] { // Split by markdown header first const sections = doc.split(/\n##\s+/); const chunks: string[] = []; for (const section of sections) { if (countTokens(section) <= maxTokens) { chunks.push(section); } else { // 더 split (paragraph) chunks.push(...splitByParagraph(section, maxTokens)); } } return chunks; } ``` ### Semantic chunking ```ts async function semanticChunk(text: string): Promise { const sentences = text.split(/[.!?]\s+/); const embeddings = await Promise.all(sentences.map(embed)); const chunks: string[] = []; let current: string[] = [sentences[0]]; for (let i = 1; i < sentences.length; i++) { const sim = cosine(embeddings[i - 1], embeddings[i]); if (sim < 0.7) { // Boundary chunks.push(current.join('. ')); current = [sentences[i]]; } else { current.push(sentences[i]); } } chunks.push(current.join('. ')); return chunks; } ``` → 의미 변화 = chunk boundary. ### Map-reduce (long doc) ```ts // Map: 각 chunk 요약 const summaries = await Promise.all(chunks.map(chunk => llm.summarize(chunk) )); // Reduce: summaries 합치기 const final = await llm.complete({ user: `Synthesize these summaries:\n${summaries.join('\n')}\n\nQuestion: ${query}`, }); ``` → 분산 처리. ### Refine (iterative) ```ts let answer = ''; for (const chunk of chunks) { answer = await llm.complete({ system: `Refine the answer based on new info.\nCurrent: ${answer}`, user: `New info: ${chunk}\nQuestion: ${query}`, }); } ``` → 점진 개선. ### Context window 계산 ```ts import { encoding_for_model } from 'tiktoken'; const enc = encoding_for_model('gpt-4o'); function countTokens(text: string): number { return enc.encode(text).length; } function fitsInContext(text: string, max: number): boolean { return countTokens(text) < max; } // 매 model 다른 budget const BUDGETS = { 'gpt-4o': 128_000 - 16_000, // 16K reserved for output 'claude-opus-4-7': 200_000 - 16_000, 'gemini-2.5-pro': 2_000_000 - 64_000, }; ``` ### Cost estimation ```ts function estimateCost(tokens: number, model: string): number { const rates: Record = { 'gpt-4o': [2.5, 10], // $/1M (input, output) 'claude-opus-4-7': [15, 75], 'gemini-2.5-pro': [2.5, 15], }; const [input, output] = rates[model]; return (tokens / 1_000_000) * input; } // 1M tokens × Claude = $15 input // → Cache 가 90% 절감 ``` ### Long context use case ``` ✅ 한 큰 doc 분석 (book, codebase, log) ✅ 코드 review (whole file) ✅ Document Q&A (single doc) ✅ Comparison (multi doc) ⚠️ Latency 느림 (1M token = 30s+) ⚠️ Cost 큼 ⚠️ Lost in middle ``` ### Long context vs RAG ``` Long context: + 단순 — 모든 거 inject + 정밀 (cherry-pick 안 함) - 비싸 - 느림 - Lost in middle RAG: + 빠름 + Cheap + Scale (큰 corpus) - Retrieval quality 중요 - 잘못된 chunk = 잘못된 답 → 상황 별 mix. ``` ### Hybrid ```ts async function answer(query: string, document: string) { if (countTokens(document) < 50_000) { // Small enough — direct return await llm.complete({ context: document, query }); } else { // Large — RAG first const chunks = chunkAndEmbed(document); const relevant = await semanticSearch(query, chunks, 10); return await llm.complete({ context: relevant.join('\n'), query }); } } ``` ### Streaming + long context ```ts // Long context = 큰 input, but output stream 가능 const stream = await openai.chat.completions.create({ model: 'gpt-4.1', messages: [...], stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0]?.delta?.content ?? ''); } ``` ### Eval (long context) ``` - Needle in haystack: 1개 fact 가 N 위치 — accuracy - Multi-needle: 여러 fact - Reasoning across: 다른 chunk 의 fact 연결 ``` ### Token budget allocation ```ts const TOTAL = 128_000; const RESPONSE = 16_000; const SYSTEM = 2_000; const HISTORY = 30_000; const CONTEXT = TOTAL - RESPONSE - SYSTEM - HISTORY; // Document 가 CONTEXT 보다 크면 — chunk + retrieve ``` ### Continual chat ```ts class ChatSession { private messages: Message[] = []; private maxTokens = 100_000; async send(userMsg: string) { this.messages.push({ role: 'user', content: userMsg }); // Trim if needed if (countTokens(this.messages) > this.maxTokens) { this.messages = await condenseHistory(this.messages); } const r = await llm.complete({ messages: this.messages }); this.messages.push({ role: 'assistant', content: r }); return r; } } ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | 작은 doc (< 30K tokens) | Direct | | Medium (30-200K) | Direct + cache | | Large (200K+) | RAG + retrieved chunks | | Multiple docs | RAG | | Single doc 깊이 | Direct (long context) | | Long conversation | Sliding + summarize | ## ❌ 안티패턴 - **모든 거 inject — context 가정 perfect**: lost in middle. - **Critical info 중간**: 끝 으로. - **Cache 무 + 같은 context 반복**: 비용. - **History 무한**: token 폭발. - **RAG vs Long context — 양자택일**: hybrid. - **Sentence cut chunking**: 의미 잃음. - **Token count 무시**: error / cost shock. ## 🤖 LLM 활용 힌트 - Lost in middle — 끝 가까이 두기. - Cache 큰 context. - RAG + long context = best. - Tiktoken 으로 사전 measure. ## 🔗 관련 문서 - [[AI_RAG_Pattern_Basics]] - [[AI_Prompt_Caching]] - [[AI_RAG_Advanced]]