[G1-Sync] Manual knowledge update

2026-05-09 22:47:42 +09:00
parent 93ec7e9056
commit 21ac3ed255
56 changed files with 22043 additions and 43 deletions
@@ -0,0 +1,436 @@
+---
+id: ai-long-context-management
+title: Long Context — 1M+ token 사용 / Compression / Chunk
+category: Coding
+status: draft
+source_trust_level: B
+verification_status: conceptual
+created_at: 2026-05-09
+updated_at: 2026-05-09
+tags: [ai, llm, context, vibe-coding]
+tech_stack: { language: "TS", applicable_to: ["Backend"] }
+applied_in: []
+aliases: [long context, context window, lost in the middle, recency bias, compression]
+---
+
+# Long Context Management
+
+> 1M+ token model (Gemini, Claude). **그러나 "lost in middle" — 시작 / 끝 가 가장 attended**. RAG / compression / hierarchical 의 가치 여전.
+
+## 📖 핵심 개념
+- Context window: 1M+ (Gemini 2.5 Pro), 200K (Claude Opus).
+- Lost in middle: 중간 token 가장 잊혀짐.
+- Recency bias: 끝 가까이 가장 영향.
+- Token cost: 큰 context = 큰 비용.
+
+## 💻 코드 패턴
+
+### Long context model (2026)
+```
+Gemini 2.5 Pro:    2M+ tokens
+Claude Opus 4.7:   1M tokens
+GPT-4.1:           1M tokens
+Llama 3.3:         128K tokens
+```
+
+→ 한 책 + 큰 codebase 가능.
+
+### Lost in middle
+```
+Test:
+"이 문서 안 어딘가 'X' 가 있다. 'X' 는 무엇인가?"
+
+위치별 accuracy:
+- 시작:  95%
+- 25%:   75%
+- 50%:   60%
+- 75%:   80%
+- 끝:    95%
+```
+
+→ 중간 둘 데이터 = 잘 안 쓰임.
+
+### Strategy 1: 중요 데이터 끝
+```ts
+const messages = [
+  { role: 'system', content: SYSTEM_PROMPT },
+  { role: 'user', content: `
+${largeContext}
+
+# Recent / important context
+${importantStuff}
+
+# Question
+${userQuery}
+` },
+];
+```
+
+→ Model 가 끝 더 attend.
+
+### Strategy 2: Retrieval + small context
+```
+Long context (1M) 일관 비싸 + 잃음.
+RAG (5K relevant chunks) 더 좋음 자주.
+
+→ Relevance 가 Length 보다 중요.
+```
+
+### Strategy 3: Hierarchical
+```
+1. Summarize each chunk (작은 LLM)
+2. Summary 가 context
+3. 필요 시 specific chunk 요청
+
+[chunk 1 summary] [chunk 2 summary] ... [chunk 100 summary]
+↓
+"Need detail of chunk 47" → fetch full
+```
+
+→ Long doc 의 navigation.
+
+### Strategy 4: Multi-step
+```ts
+// Step 1: Question understanding
+const questionType = await llm.analyze(query);
+
+// Step 2: Relevant section (작은 model)
+const sections = await llm.identify(largeDoc, questionType);
+
+// Step 3: Detailed answer (big model)
+const answer = await llm.complete({
+  context: sections,
+  query,
+});
+```
+
+→ Retrieval + reasoning 분리.
+
+### Strategy 5: Compression
+```ts
+// LLMLingua / LongLLMLingua
+// Original: 10K tokens
+// Compressed: 2K tokens (key info 만)
+
+import { compress } from 'llmlingua-js';
+const compressed = await compress(longText, { ratio: 0.3 });
+```
+
+→ 70% token 줄임. Accuracy 유지.
+
+### Sliding window (chat history)
+```ts
+function trimHistory(messages: Message[], maxTokens: number): Message[] {
+  let total = 0;
+  const result: Message[] = [];
+  
+  // Keep system message
+  if (messages[0].role === 'system') {
+    result.push(messages[0]);
+    total += countTokens(messages[0].content);
+  }
+  
+  // Add recent messages first
+  for (let i = messages.length - 1; i >= (result.length > 0 ? 1 : 0); i--) {
+    const tokens = countTokens(messages[i].content);
+    if (total + tokens > maxTokens) break;
+    total += tokens;
+    result.splice(result.length > 0 && result[0].role === 'system' ? 1 : 0, 0, messages[i]);
+  }
+  
+  return result;
+}
+```
+
+### Summarization 가 옛 messages
+```ts
+async function condenseHistory(messages: Message[]): Promise<Message[]> {
+  if (messages.length < 20) return messages;
+  
+  const old = messages.slice(0, -10);
+  const recent = messages.slice(-10);
+  
+  const summary = await llm.complete({
+    system: 'Summarize this conversation in 200 words. Keep key facts.',
+    user: old.map(m => `${m.role}: ${m.content}`).join('\n'),
+  });
+  
+  return [
+    { role: 'system', content: `Earlier conversation summary:\n${summary}` },
+    ...recent,
+  ];
+}
+```
+
+→ Context window 안 머무름.
+
+### Caching (Anthropic)
+```ts
+// 큰 context 가 자주 같음 → cache
+const r = await anthropic.messages.create({
+  model: 'claude-opus-4-7',
+  system: [
+    {
+      type: 'text',
+      text: hugeDoc,  // 200K tokens
+      cache_control: { type: 'ephemeral', ttl: '1h' },
+    },
+  ],
+  messages: [{ role: 'user', content: question }],
+});
+```
+
+→ 90% cost 절감 후속 호출.
+
+→ [[AI_Prompt_Caching]].
+
+### Chunking strategy
+```
+Fixed size: simple, but 의미 cut.
+Sentence: 자연.
+Paragraph: 의미 단위.
+Section (heading): 큰 boundary.
+Semantic: LLM 가 boundary 결정.
+
+→ 가장 의미 있는 boundary.
+```
+
+```ts
+function smartChunk(doc: string, maxTokens = 1000): string[] {
+  // Split by markdown header first
+  const sections = doc.split(/\n##\s+/);
+  
+  const chunks: string[] = [];
+  for (const section of sections) {
+    if (countTokens(section) <= maxTokens) {
+      chunks.push(section);
+    } else {
+      // 더 split (paragraph)
+      chunks.push(...splitByParagraph(section, maxTokens));
+    }
+  }
+  return chunks;
+}
+```
+
+### Semantic chunking
+```ts
+async function semanticChunk(text: string): Promise<string[]> {
+  const sentences = text.split(/[.!?]\s+/);
+  const embeddings = await Promise.all(sentences.map(embed));
+  
+  const chunks: string[] = [];
+  let current: string[] = [sentences[0]];
+  
+  for (let i = 1; i < sentences.length; i++) {
+    const sim = cosine(embeddings[i - 1], embeddings[i]);
+    if (sim < 0.7) {
+      // Boundary
+      chunks.push(current.join('. '));
+      current = [sentences[i]];
+    } else {
+      current.push(sentences[i]);
+    }
+  }
+  chunks.push(current.join('. '));
+  
+  return chunks;
+}
+```
+
+→ 의미 변화 = chunk boundary.
+
+### Map-reduce (long doc)
+```ts
+// Map: 각 chunk 요약
+const summaries = await Promise.all(chunks.map(chunk => 
+  llm.summarize(chunk)
+));
+
+// Reduce: summaries 합치기
+const final = await llm.complete({
+  user: `Synthesize these summaries:\n${summaries.join('\n')}\n\nQuestion: ${query}`,
+});
+```
+
+→ 분산 처리.
+
+### Refine (iterative)
+```ts
+let answer = '';
+for (const chunk of chunks) {
+  answer = await llm.complete({
+    system: `Refine the answer based on new info.\nCurrent: ${answer}`,
+    user: `New info: ${chunk}\nQuestion: ${query}`,
+  });
+}
+```
+
+→ 점진 개선.
+
+### Context window 계산
+```ts
+import { encoding_for_model } from 'tiktoken';
+
+const enc = encoding_for_model('gpt-4o');
+
+function countTokens(text: string): number {
+  return enc.encode(text).length;
+}
+
+function fitsInContext(text: string, max: number): boolean {
+  return countTokens(text) < max;
+}
+
+// 매 model 다른 budget
+const BUDGETS = {
+  'gpt-4o': 128_000 - 16_000,  // 16K reserved for output
+  'claude-opus-4-7': 200_000 - 16_000,
+  'gemini-2.5-pro': 2_000_000 - 64_000,
+};
+```
+
+### Cost estimation
+```ts
+function estimateCost(tokens: number, model: string): number {
+  const rates: Record<string, [number, number]> = {
+    'gpt-4o': [2.5, 10],  // $/1M (input, output)
+    'claude-opus-4-7': [15, 75],
+    'gemini-2.5-pro': [2.5, 15],
+  };
+  const [input, output] = rates[model];
+  return (tokens / 1_000_000) * input;
+}
+
+// 1M tokens × Claude = $15 input
+// → Cache 가 90% 절감
+```
+
+### Long context use case
+```
+✅ 한 큰 doc 분석 (book, codebase, log)
+✅ 코드 review (whole file)
+✅ Document Q&A (single doc)
+✅ Comparison (multi doc)
+
+⚠️ Latency 느림 (1M token = 30s+)
+⚠️ Cost 큼
+⚠️ Lost in middle
+```
+
+### Long context vs RAG
+```
+Long context:
+ 단순 — 모든 거 inject
+ 정밀 (cherry-pick 안 함)
+- 비싸
+- 느림
+- Lost in middle
+
+RAG:
+ 빠름
+ Cheap
+ Scale (큰 corpus)
+- Retrieval quality 중요
+- 잘못된 chunk = 잘못된 답
+
+→ 상황 별 mix.
+```
+
+### Hybrid
+```ts
+async function answer(query: string, document: string) {
+  if (countTokens(document) < 50_000) {
+    // Small enough — direct
+    return await llm.complete({ context: document, query });
+  } else {
+    // Large — RAG first
+    const chunks = chunkAndEmbed(document);
+    const relevant = await semanticSearch(query, chunks, 10);
+    return await llm.complete({ context: relevant.join('\n'), query });
+  }
+}
+```
+
+### Streaming + long context
+```ts
+// Long context = 큰 input, but output stream 가능
+const stream = await openai.chat.completions.create({
+  model: 'gpt-4.1',
+  messages: [...],
+  stream: true,
+});
+
+for await (const chunk of stream) {
+  process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
+}
+```
+
+### Eval (long context)
+```
+- Needle in haystack: 1개 fact 가 N 위치 — accuracy
+- Multi-needle: 여러 fact
+- Reasoning across: 다른 chunk 의 fact 연결
+```
+
+### Token budget allocation
+```ts
+const TOTAL = 128_000;
+const RESPONSE = 16_000;
+const SYSTEM = 2_000;
+const HISTORY = 30_000;
+const CONTEXT = TOTAL - RESPONSE - SYSTEM - HISTORY;
+
+// Document 가 CONTEXT 보다 크면 — chunk + retrieve
+```
+
+### Continual chat
+```ts
+class ChatSession {
+  private messages: Message[] = [];
+  private maxTokens = 100_000;
+  
+  async send(userMsg: string) {
+    this.messages.push({ role: 'user', content: userMsg });
+    
+    // Trim if needed
+    if (countTokens(this.messages) > this.maxTokens) {
+      this.messages = await condenseHistory(this.messages);
+    }
+    
+    const r = await llm.complete({ messages: this.messages });
+    this.messages.push({ role: 'assistant', content: r });
+    return r;
+  }
+}
+```
+
+## 🤔 의사결정 기준
+| 상황 | 추천 |
+|---|---|
+| 작은 doc (< 30K tokens) | Direct |
+| Medium (30-200K) | Direct + cache |
+| Large (200K+) | RAG + retrieved chunks |
+| Multiple docs | RAG |
+| Single doc 깊이 | Direct (long context) |
+| Long conversation | Sliding + summarize |
+
+## ❌ 안티패턴
+- **모든 거 inject — context 가정 perfect**: lost in middle.
+- **Critical info 중간**: 끝 으로.
+- **Cache 무 + 같은 context 반복**: 비용.
+- **History 무한**: token 폭발.
+- **RAG vs Long context — 양자택일**: hybrid.
+- **Sentence cut chunking**: 의미 잃음.
+- **Token count 무시**: error / cost shock.
+
+## 🤖 LLM 활용 힌트
+- Lost in middle — 끝 가까이 두기.
+- Cache 큰 context.
+- RAG + long context = best.
+- Tiktoken 으로 사전 measure.
+
+## 🔗 관련 문서
+- [[AI_RAG_Pattern_Basics]]
+- [[AI_Prompt_Caching]]
+- [[AI_RAG_Advanced]]