---
id: ai-long-context-management
title: Long Context — 1M+ token 사용 / Compression / Chunk
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, llm, context, vibe-coding]
tech_stack: { language: "TS", applicable_to: ["Backend"] }
applied_in: []
aliases: [long context, context window, lost in the middle, recency bias, compression]
---

# Long Context Management

> 1M+ token model (Gemini, Claude). **그러나 "lost in middle" — 시작 / 끝 가 가장 attended**. RAG / compression / hierarchical 의 가치 여전.

## 📖 핵심 개념
- Context window: 1M+ (Gemini 2.5 Pro), 200K (Claude Opus).
- Lost in middle: 중간 token 가장 잊혀짐.
- Recency bias: 끝 가까이 가장 영향.
- Token cost: 큰 context = 큰 비용.

## 💻 코드 패턴

### Long context model (2026)
```
Gemini 2.5 Pro:    2M+ tokens
Claude Opus 4.7:   1M tokens
GPT-4.1:           1M tokens
Llama 3.3:         128K tokens
```

→ 한 책 + 큰 codebase 가능.

### Lost in middle
```
Test:
"이 문서 안 어딘가 'X' 가 있다. 'X' 는 무엇인가?"

위치별 accuracy:
- 시작:  95%
- 25%:   75%
- 50%:   60%
- 75%:   80%
- 끝:    95%
```

→ 중간 둘 데이터 = 잘 안 쓰임.

### Strategy 1: 중요 데이터 끝
```ts
const messages = [
  { role: 'system', content: SYSTEM_PROMPT },
  { role: 'user', content: `
${largeContext}

# Recent / important context
${importantStuff}

# Question
${userQuery}
` },
];
```

→ Model 가 끝 더 attend.

### Strategy 2: Retrieval + small context
```
Long context (1M) 일관 비싸 + 잃음.
RAG (5K relevant chunks) 더 좋음 자주.

→ Relevance 가 Length 보다 중요.
```

### Strategy 3: Hierarchical
```
1. Summarize each chunk (작은 LLM)
2. Summary 가 context
3. 필요 시 specific chunk 요청

[chunk 1 summary] [chunk 2 summary] ... [chunk 100 summary]
↓
"Need detail of chunk 47" → fetch full
```

→ Long doc 의 navigation.

### Strategy 4: Multi-step
```ts
// Step 1: Question understanding
const questionType = await llm.analyze(query);

// Step 2: Relevant section (작은 model)
const sections = await llm.identify(largeDoc, questionType);

// Step 3: Detailed answer (big model)
const answer = await llm.complete({
  context: sections,
  query,
});
```

→ Retrieval + reasoning 분리.

### Strategy 5: Compression
```ts
// LLMLingua / LongLLMLingua
// Original: 10K tokens
// Compressed: 2K tokens (key info 만)

import { compress } from 'llmlingua-js';
const compressed = await compress(longText, { ratio: 0.3 });
```

→ 70% token 줄임. Accuracy 유지.

### Sliding window (chat history)
```ts
function trimHistory(messages: Message[], maxTokens: number): Message[] {
  let total = 0;
  const result: Message[] = [];
  
  // Keep system message
  if (messages[0].role === 'system') {
    result.push(messages[0]);
    total += countTokens(messages[0].content);
  }
  
  // Add recent messages first
  for (let i = messages.length - 1; i >= (result.length > 0 ? 1 : 0); i--) {
    const tokens = countTokens(messages[i].content);
    if (total + tokens > maxTokens) break;
    total += tokens;
    result.splice(result.length > 0 && result[0].role === 'system' ? 1 : 0, 0, messages[i]);
  }
  
  return result;
}
```

### Summarization 가 옛 messages
```ts
async function condenseHistory(messages: Message[]): Promise<Message[]> {
  if (messages.length < 20) return messages;
  
  const old = messages.slice(0, -10);
  const recent = messages.slice(-10);
  
  const summary = await llm.complete({
    system: 'Summarize this conversation in 200 words. Keep key facts.',
    user: old.map(m => `${m.role}: ${m.content}`).join('\n'),
  });
  
  return [
    { role: 'system', content: `Earlier conversation summary:\n${summary}` },
    ...recent,
  ];
}
```

→ Context window 안 머무름.

### Caching (Anthropic)
```ts
// 큰 context 가 자주 같음 → cache
const r = await anthropic.messages.create({
  model: 'claude-opus-4-7',
  system: [
    {
      type: 'text',
      text: hugeDoc,  // 200K tokens
      cache_control: { type: 'ephemeral', ttl: '1h' },
    },
  ],
  messages: [{ role: 'user', content: question }],
});
```

→ 90% cost 절감 후속 호출.

→ [[AI_Prompt_Caching]].

### Chunking strategy
```
Fixed size: simple, but 의미 cut.
Sentence: 자연.
Paragraph: 의미 단위.
Section (heading): 큰 boundary.
Semantic: LLM 가 boundary 결정.

→ 가장 의미 있는 boundary.
```

```ts
function smartChunk(doc: string, maxTokens = 1000): string[] {
  // Split by markdown header first
  const sections = doc.split(/\n##\s+/);
  
  const chunks: string[] = [];
  for (const section of sections) {
    if (countTokens(section) <= maxTokens) {
      chunks.push(section);
    } else {
      // 더 split (paragraph)
      chunks.push(...splitByParagraph(section, maxTokens));
    }
  }
  return chunks;
}
```

### Semantic chunking
```ts
async function semanticChunk(text: string): Promise<string[]> {
  const sentences = text.split(/[.!?]\s+/);
  const embeddings = await Promise.all(sentences.map(embed));
  
  const chunks: string[] = [];
  let current: string[] = [sentences[0]];
  
  for (let i = 1; i < sentences.length; i++) {
    const sim = cosine(embeddings[i - 1], embeddings[i]);
    if (sim < 0.7) {
      // Boundary
      chunks.push(current.join('. '));
      current = [sentences[i]];
    } else {
      current.push(sentences[i]);
    }
  }
  chunks.push(current.join('. '));
  
  return chunks;
}
```

→ 의미 변화 = chunk boundary.

### Map-reduce (long doc)
```ts
// Map: 각 chunk 요약
const summaries = await Promise.all(chunks.map(chunk => 
  llm.summarize(chunk)
));

// Reduce: summaries 합치기
const final = await llm.complete({
  user: `Synthesize these summaries:\n${summaries.join('\n')}\n\nQuestion: ${query}`,
});
```

→ 분산 처리.

### Refine (iterative)
```ts
let answer = '';
for (const chunk of chunks) {
  answer = await llm.complete({
    system: `Refine the answer based on new info.\nCurrent: ${answer}`,
    user: `New info: ${chunk}\nQuestion: ${query}`,
  });
}
```

→ 점진 개선.

### Context window 계산
```ts
import { encoding_for_model } from 'tiktoken';

const enc = encoding_for_model('gpt-4o');

function countTokens(text: string): number {
  return enc.encode(text).length;
}

function fitsInContext(text: string, max: number): boolean {
  return countTokens(text) < max;
}

// 매 model 다른 budget
const BUDGETS = {
  'gpt-4o': 128_000 - 16_000,  // 16K reserved for output
  'claude-opus-4-7': 200_000 - 16_000,
  'gemini-2.5-pro': 2_000_000 - 64_000,
};
```

### Cost estimation
```ts
function estimateCost(tokens: number, model: string): number {
  const rates: Record<string, [number, number]> = {
    'gpt-4o': [2.5, 10],  // $/1M (input, output)
    'claude-opus-4-7': [15, 75],
    'gemini-2.5-pro': [2.5, 15],
  };
  const [input, output] = rates[model];
  return (tokens / 1_000_000) * input;
}

// 1M tokens × Claude = $15 input
// → Cache 가 90% 절감
```

### Long context use case
```
✅ 한 큰 doc 분석 (book, codebase, log)
✅ 코드 review (whole file)
✅ Document Q&A (single doc)
✅ Comparison (multi doc)

⚠️ Latency 느림 (1M token = 30s+)
⚠️ Cost 큼
⚠️ Lost in middle
```

### Long context vs RAG
```
Long context:
+ 단순 — 모든 거 inject
+ 정밀 (cherry-pick 안 함)
- 비싸
- 느림
- Lost in middle

RAG:
+ 빠름
+ Cheap
+ Scale (큰 corpus)
- Retrieval quality 중요
- 잘못된 chunk = 잘못된 답

→ 상황 별 mix.
```

### Hybrid
```ts
async function answer(query: string, document: string) {
  if (countTokens(document) < 50_000) {
    // Small enough — direct
    return await llm.complete({ context: document, query });
  } else {
    // Large — RAG first
    const chunks = chunkAndEmbed(document);
    const relevant = await semanticSearch(query, chunks, 10);
    return await llm.complete({ context: relevant.join('\n'), query });
  }
}
```

### Streaming + long context
```ts
// Long context = 큰 input, but output stream 가능
const stream = await openai.chat.completions.create({
  model: 'gpt-4.1',
  messages: [...],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
```

### Eval (long context)
```
- Needle in haystack: 1개 fact 가 N 위치 — accuracy
- Multi-needle: 여러 fact
- Reasoning across: 다른 chunk 의 fact 연결
```

### Token budget allocation
```ts
const TOTAL = 128_000;
const RESPONSE = 16_000;
const SYSTEM = 2_000;
const HISTORY = 30_000;
const CONTEXT = TOTAL - RESPONSE - SYSTEM - HISTORY;

// Document 가 CONTEXT 보다 크면 — chunk + retrieve
```

### Continual chat
```ts
class ChatSession {
  private messages: Message[] = [];
  private maxTokens = 100_000;
  
  async send(userMsg: string) {
    this.messages.push({ role: 'user', content: userMsg });
    
    // Trim if needed
    if (countTokens(this.messages) > this.maxTokens) {
      this.messages = await condenseHistory(this.messages);
    }
    
    const r = await llm.complete({ messages: this.messages });
    this.messages.push({ role: 'assistant', content: r });
    return r;
  }
}
```

## 🤔 의사결정 기준
| 상황 | 추천 |
|---|---|
| 작은 doc (< 30K tokens) | Direct |
| Medium (30-200K) | Direct + cache |
| Large (200K+) | RAG + retrieved chunks |
| Multiple docs | RAG |
| Single doc 깊이 | Direct (long context) |
| Long conversation | Sliding + summarize |

## ❌ 안티패턴
- **모든 거 inject — context 가정 perfect**: lost in middle.
- **Critical info 중간**: 끝 으로.
- **Cache 무 + 같은 context 반복**: 비용.
- **History 무한**: token 폭발.
- **RAG vs Long context — 양자택일**: hybrid.
- **Sentence cut chunking**: 의미 잃음.
- **Token count 무시**: error / cost shock.

## 🤖 LLM 활용 힌트
- Lost in middle — 끝 가까이 두기.
- Cache 큰 context.
- RAG + long context = best.
- Tiktoken 으로 사전 measure.

## 🔗 관련 문서
- [[AI_RAG_Pattern_Basics]]
- [[AI_Prompt_Caching]]
- [[AI_RAG_Advanced]]