[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,376 @@
|
||||
---
|
||||
id: ai-rag-production
|
||||
title: RAG Production — chunking / re-rank / eval
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, rag, production, vibe-coding]
|
||||
tech_stack: { language: "TS / Python", applicable_to: ["AI"] }
|
||||
applied_in: []
|
||||
aliases: [RAG production, document chunking, parent document, hybrid search, rerank, RAG eval]
|
||||
---
|
||||
|
||||
# RAG Production
|
||||
|
||||
> Demo RAG = simple. **Production = chunking strategy + hybrid search + reranker + eval + monitoring**.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Document → chunks → embed → vector store.
|
||||
- Query → retrieve → rerank → context.
|
||||
- Eval (recall, precision).
|
||||
- Continuous improvement (golden set).
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### Chunking strategy
|
||||
```python
|
||||
# 1. Fixed size (단순)
|
||||
def chunk_fixed(text, size=500, overlap=50):
|
||||
return [text[i:i+size] for i in range(0, len(text), size - overlap)]
|
||||
|
||||
# 2. Sentence-based
|
||||
import re
|
||||
def chunk_sentences(text, max_sentences=5):
|
||||
sentences = re.split(r'(?<=[.!?])\s+', text)
|
||||
return [' '.join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]
|
||||
|
||||
# 3. Semantic (LLM-driven)
|
||||
# 4. Markdown headers
|
||||
# 5. Recursive (LangChain RecursiveCharacterTextSplitter)
|
||||
```
|
||||
|
||||
### Recursive chunking (best)
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=500,
|
||||
chunk_overlap=50,
|
||||
separators=['\n\n', '\n', '. ', ' ', ''],
|
||||
)
|
||||
chunks = splitter.split_text(text)
|
||||
```
|
||||
|
||||
→ Boundary 보존 (paragraph → sentence → word).
|
||||
|
||||
### Parent document retriever
|
||||
```python
|
||||
# Small chunk = embed (precision).
|
||||
# Big chunk (parent) = context (recall).
|
||||
|
||||
# Search small → return parent.
|
||||
```
|
||||
|
||||
```python
|
||||
from langchain.retrievers import ParentDocumentRetriever
|
||||
retriever = ParentDocumentRetriever(
|
||||
vectorstore=...,
|
||||
docstore=...,
|
||||
child_splitter=child, # 200 char
|
||||
parent_splitter=parent, # 2000 char
|
||||
)
|
||||
```
|
||||
|
||||
### Hybrid search
|
||||
```ts
|
||||
// BM25 + vector (RRF)
|
||||
const bm25Results = await bm25Search(query, 50);
|
||||
const vecResults = await vectorSearch(query, 50);
|
||||
const fused = rrf([bm25Results, vecResults]).slice(0, 20);
|
||||
```
|
||||
|
||||
→ [[AI_Hybrid_Search_Patterns]].
|
||||
|
||||
### Reranker
|
||||
```python
|
||||
from sentence_transformers import CrossEncoder
|
||||
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
|
||||
|
||||
candidates = hybrid_search(query, k=50)
|
||||
pairs = [(query, c.text) for c in candidates]
|
||||
scores = reranker.predict(pairs)
|
||||
top = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]
|
||||
```
|
||||
|
||||
→ Top-50 → top-5. Quality ↑.
|
||||
|
||||
### Cohere Rerank
|
||||
```ts
|
||||
const r = await cohere.rerank({
|
||||
query, documents: candidates.map(c => c.text), topN: 5,
|
||||
model: 'rerank-english-v3.0',
|
||||
});
|
||||
```
|
||||
|
||||
→ Managed.
|
||||
|
||||
### Query expansion
|
||||
```python
|
||||
# LLM 가 query 재작성 (3 variant)
|
||||
expanded = llm.complete(f'Generate 3 alternative phrasings of: "{query}"')
|
||||
queries = [query, *expanded.split('\n')]
|
||||
|
||||
# 매 query 검색 + RRF
|
||||
results = [vector_search(q, 20) for q in queries]
|
||||
fused = rrf(results)
|
||||
```
|
||||
|
||||
### HyDE (Hypothetical Document Embedding)
|
||||
```python
|
||||
# 가짜 답 생성 → embed → 검색
|
||||
hypothetical = llm.complete(f'Detailed answer for: {query}')
|
||||
emb = embed(hypothetical)
|
||||
results = vector_search(emb, 20)
|
||||
```
|
||||
|
||||
→ Query 가 짧음 = 답 의 embed 가 더 가까움.
|
||||
|
||||
### Multi-vector
|
||||
```python
|
||||
# Doc 의 매 section 가 own embed.
|
||||
# 1 section hit → doc 가 결과.
|
||||
```
|
||||
|
||||
### Metadata filter
|
||||
```sql
|
||||
SELECT * FROM docs
|
||||
WHERE category = $1 AND date > $2
|
||||
ORDER BY embedding <=> $3
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
→ Pre-filter (efficient).
|
||||
|
||||
### Citation
|
||||
```python
|
||||
# 매 chunk 의 source 보존.
|
||||
prompt = f'''
|
||||
Answer using ONLY:
|
||||
[1] {chunks[0].text} (source: {chunks[0].source})
|
||||
[2] {chunks[1].text}
|
||||
|
||||
Question: {query}
|
||||
|
||||
Cite [1], [2].
|
||||
'''
|
||||
```
|
||||
|
||||
→ User trust ↑.
|
||||
|
||||
### Prompt template
|
||||
```python
|
||||
SYSTEM = '''
|
||||
Answer using ONLY the context. If unsure, say "I don't know".
|
||||
Cite sources [1], [2].
|
||||
'''
|
||||
|
||||
USER = f'''
|
||||
Context:
|
||||
{context}
|
||||
|
||||
Question: {query}
|
||||
|
||||
Answer:
|
||||
'''
|
||||
```
|
||||
|
||||
### Eval (recall@K)
|
||||
```python
|
||||
def recall_at_k(predicted_ids, gold_ids, k=5):
|
||||
return len(set(predicted_ids[:k]) & set(gold_ids)) / len(gold_ids)
|
||||
|
||||
# Golden set (curated)
|
||||
gold = [{'query': 'X', 'relevant_docs': ['doc1', 'doc5']}]
|
||||
results = [retrieve(q['query']) for q in gold]
|
||||
recalls = [recall_at_k(r, q['relevant_docs']) for r, q in zip(results, gold)]
|
||||
print(f'Avg recall: {sum(recalls)/len(recalls):.2f}')
|
||||
```
|
||||
|
||||
### LLM-judge eval
|
||||
```python
|
||||
# Promptfoo / RAGAS
|
||||
from ragas.metrics import faithfulness, answer_relevancy, context_precision
|
||||
|
||||
eval_dataset = [...]
|
||||
result = evaluate(eval_dataset, [faithfulness, answer_relevancy, context_precision])
|
||||
```
|
||||
|
||||
→ Faithfulness = answer 가 context 에서 나옴.
|
||||
|
||||
### Monitoring (production)
|
||||
```python
|
||||
@trace
|
||||
def rag(query):
|
||||
docs = retrieve(query)
|
||||
answer = llm.complete(...)
|
||||
log({'query': query, 'doc_count': len(docs), 'tokens': ..., 'latency': ...})
|
||||
return answer
|
||||
```
|
||||
|
||||
→ Helicone / LangSmith.
|
||||
|
||||
### Cache
|
||||
```python
|
||||
# Same query = cached result.
|
||||
key = hashlib.sha256(query.encode()).hexdigest()
|
||||
cached = cache.get(key)
|
||||
if cached: return cached
|
||||
|
||||
# 또는 prompt cache (Anthropic / OpenAI).
|
||||
```
|
||||
|
||||
### Continuous improvement
|
||||
```
|
||||
1. Production query log.
|
||||
2. Bad answer = manual review.
|
||||
3. Add to golden set.
|
||||
4. Re-eval → improve.
|
||||
5. Re-deploy.
|
||||
```
|
||||
|
||||
→ RAG quality 가 시간 따라 ↑.
|
||||
|
||||
### Embedding model 선택
|
||||
```
|
||||
text-embedding-3-small (OpenAI): cheap, 좋은.
|
||||
text-embedding-3-large: 더 정확.
|
||||
voyage-3 / cohere embed-v3: SoTA.
|
||||
BGE / e5 (open): self-host.
|
||||
```
|
||||
|
||||
→ MTEB leaderboard 참고.
|
||||
|
||||
### Re-embedding (model 변경)
|
||||
```
|
||||
새 model 가 더 좋음 → 모든 doc 재 embed.
|
||||
- Cost 큰 (1M doc × $0.02 / M token).
|
||||
- Time (수 시간).
|
||||
```
|
||||
|
||||
→ Plan 가 필요.
|
||||
|
||||
### Vector DB 선택
|
||||
```
|
||||
pgvector: simple, Postgres 친화.
|
||||
Pinecone: managed, 빠름.
|
||||
Qdrant: open source, 빠름, hybrid built-in.
|
||||
Weaviate: 큰 features.
|
||||
Milvus: 큰 scale.
|
||||
ChromaDB: 작은 / dev.
|
||||
```
|
||||
|
||||
→ [[DB_pgvector_Production]].
|
||||
|
||||
### Chunk metadata
|
||||
```json
|
||||
{
|
||||
"id": "chunk-1",
|
||||
"text": "...",
|
||||
"embedding": [...],
|
||||
"source": "doc.pdf",
|
||||
"page": 3,
|
||||
"section": "Introduction",
|
||||
"category": "engineering",
|
||||
"created_at": "2026-05-01"
|
||||
}
|
||||
```
|
||||
|
||||
→ Filter / citation 친화.
|
||||
|
||||
### Production architecture
|
||||
```
|
||||
Doc upload → Parse → Chunk → Embed → Vector DB.
|
||||
Query → Embed → Hybrid search → Rerank → LLM → Answer + Citation.
|
||||
|
||||
→ Chunking + ranking 가 가장 큰 quality lever.
|
||||
```
|
||||
|
||||
### Multi-modal RAG
|
||||
```
|
||||
Doc 가 image / table 도.
|
||||
- Image embed (CLIP / Cohere multi-modal).
|
||||
- Table → markdown.
|
||||
- Combined search.
|
||||
```
|
||||
|
||||
### Long context vs RAG
|
||||
```
|
||||
Long context (200k):
|
||||
- Simple, all in.
|
||||
- Cost / latency 큰.
|
||||
|
||||
RAG:
|
||||
- Top-K only.
|
||||
- Cost / latency 작은.
|
||||
- Tuning 필요.
|
||||
|
||||
→ < 50k = long context.
|
||||
> 50k = RAG.
|
||||
```
|
||||
|
||||
### Cost / 1k query
|
||||
```
|
||||
Small RAG (10 chunks, GPT-4o-mini): $0.50.
|
||||
Large RAG (50 chunks + rerank, GPT-4o): $50.
|
||||
+ Embedding storage: $.
|
||||
|
||||
→ 매 query 가 multiple LLM call.
|
||||
```
|
||||
|
||||
### Limitation
|
||||
```
|
||||
- Lost in the middle (긴 context).
|
||||
- Multi-hop reasoning (1 chunk 가 답 X).
|
||||
- Negation ('이 가 아닌 것').
|
||||
- Recent data (cutoff).
|
||||
```
|
||||
|
||||
→ Agentic RAG / iterative 가 답.
|
||||
|
||||
### Iterative RAG
|
||||
```python
|
||||
def iterative_rag(query, max_steps=3):
|
||||
context = ''
|
||||
for step in range(max_steps):
|
||||
new_query = llm.complete(f'Q: {query}\nKnown: {context}\nWhat else needed?')
|
||||
docs = retrieve(new_query)
|
||||
context += format(docs)
|
||||
if llm.complete(f'Sufficient? Y/N {context}') == 'Y':
|
||||
break
|
||||
return llm.complete(f'Q: {query}\n{context}')
|
||||
```
|
||||
|
||||
→ Multi-hop 의 답.
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 작업 | 추천 |
|
||||
|---|---|
|
||||
| Document Q&A | RAG |
|
||||
| Code search | Hybrid + AST chunk |
|
||||
| Multi-hop | Agentic RAG |
|
||||
| Real-time | Cached prompts |
|
||||
| Production | Hybrid + rerank + eval |
|
||||
| 작은 / quick | LangChain default |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **Vector 만**: keyword 약함.
|
||||
- **Fixed chunk**: boundary 깨짐.
|
||||
- **No rerank**: noise.
|
||||
- **No citation**: 신뢰 X.
|
||||
- **No eval**: silent regression.
|
||||
- **Huge chunk**: noise.
|
||||
- **Tiny chunk**: context 잃음.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- Recursive chunking + hybrid + rerank 가 baseline.
|
||||
- Citation + eval 가 production.
|
||||
- Iterative RAG 가 multi-hop.
|
||||
- Continuous golden set update.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_RAG_Pattern_Basics]]
|
||||
- [[AI_RAG_Advanced]]
|
||||
- [[AI_Hybrid_Search_Patterns]]
|
||||
Reference in New Issue
Block a user