8.2 KiB
8.2 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ai-rag-production | RAG Production — chunking / re-rank / eval | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
RAG Production
Demo RAG = simple. Production = chunking strategy + hybrid search + reranker + eval + monitoring.
📖 핵심 개념
- Document → chunks → embed → vector store.
- Query → retrieve → rerank → context.
- Eval (recall, precision).
- Continuous improvement (golden set).
💻 코드 패턴
Chunking strategy
# 1. Fixed size (단순)
def chunk_fixed(text, size=500, overlap=50):
return [text[i:i+size] for i in range(0, len(text), size - overlap)]
# 2. Sentence-based
import re
def chunk_sentences(text, max_sentences=5):
sentences = re.split(r'(?<=[.!?])\s+', text)
return [' '.join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]
# 3. Semantic (LLM-driven)
# 4. Markdown headers
# 5. Recursive (LangChain RecursiveCharacterTextSplitter)
Recursive chunking (best)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=['\n\n', '\n', '. ', ' ', ''],
)
chunks = splitter.split_text(text)
→ Boundary 보존 (paragraph → sentence → word).
Parent document retriever
# Small chunk = embed (precision).
# Big chunk (parent) = context (recall).
# Search small → return parent.
from langchain.retrievers import ParentDocumentRetriever
retriever = ParentDocumentRetriever(
vectorstore=...,
docstore=...,
child_splitter=child, # 200 char
parent_splitter=parent, # 2000 char
)
Hybrid search
// BM25 + vector (RRF)
const bm25Results = await bm25Search(query, 50);
const vecResults = await vectorSearch(query, 50);
const fused = rrf([bm25Results, vecResults]).slice(0, 20);
Reranker
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
candidates = hybrid_search(query, k=50)
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
top = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]
→ Top-50 → top-5. Quality ↑.
Cohere Rerank
const r = await cohere.rerank({
query, documents: candidates.map(c => c.text), topN: 5,
model: 'rerank-english-v3.0',
});
→ Managed.
Query expansion
# LLM 가 query 재작성 (3 variant)
expanded = llm.complete(f'Generate 3 alternative phrasings of: "{query}"')
queries = [query, *expanded.split('\n')]
# 매 query 검색 + RRF
results = [vector_search(q, 20) for q in queries]
fused = rrf(results)
HyDE (Hypothetical Document Embedding)
# 가짜 답 생성 → embed → 검색
hypothetical = llm.complete(f'Detailed answer for: {query}')
emb = embed(hypothetical)
results = vector_search(emb, 20)
→ Query 가 짧음 = 답 의 embed 가 더 가까움.
Multi-vector
# Doc 의 매 section 가 own embed.
# 1 section hit → doc 가 결과.
Metadata filter
SELECT * FROM docs
WHERE category = $1 AND date > $2
ORDER BY embedding <=> $3
LIMIT 20;
→ Pre-filter (efficient).
Citation
# 매 chunk 의 source 보존.
prompt = f'''
Answer using ONLY:
[1] {chunks[0].text} (source: {chunks[0].source})
[2] {chunks[1].text}
Question: {query}
Cite [1], [2].
'''
→ User trust ↑.
Prompt template
SYSTEM = '''
Answer using ONLY the context. If unsure, say "I don't know".
Cite sources [1], [2].
'''
USER = f'''
Context:
{context}
Question: {query}
Answer:
'''
Eval (recall@K)
def recall_at_k(predicted_ids, gold_ids, k=5):
return len(set(predicted_ids[:k]) & set(gold_ids)) / len(gold_ids)
# Golden set (curated)
gold = [{'query': 'X', 'relevant_docs': ['doc1', 'doc5']}]
results = [retrieve(q['query']) for q in gold]
recalls = [recall_at_k(r, q['relevant_docs']) for r, q in zip(results, gold)]
print(f'Avg recall: {sum(recalls)/len(recalls):.2f}')
LLM-judge eval
# Promptfoo / RAGAS
from ragas.metrics import faithfulness, answer_relevancy, context_precision
eval_dataset = [...]
result = evaluate(eval_dataset, [faithfulness, answer_relevancy, context_precision])
→ Faithfulness = answer 가 context 에서 나옴.
Monitoring (production)
@trace
def rag(query):
docs = retrieve(query)
answer = llm.complete(...)
log({'query': query, 'doc_count': len(docs), 'tokens': ..., 'latency': ...})
return answer
→ Helicone / LangSmith.
Cache
# Same query = cached result.
key = hashlib.sha256(query.encode()).hexdigest()
cached = cache.get(key)
if cached: return cached
# 또는 prompt cache (Anthropic / OpenAI).
Continuous improvement
1. Production query log.
2. Bad answer = manual review.
3. Add to golden set.
4. Re-eval → improve.
5. Re-deploy.
→ RAG quality 가 시간 따라 ↑.
Embedding model 선택
text-embedding-3-small (OpenAI): cheap, 좋은.
text-embedding-3-large: 더 정확.
voyage-3 / cohere embed-v3: SoTA.
BGE / e5 (open): self-host.
→ MTEB leaderboard 참고.
Re-embedding (model 변경)
새 model 가 더 좋음 → 모든 doc 재 embed.
- Cost 큰 (1M doc × $0.02 / M token).
- Time (수 시간).
→ Plan 가 필요.
Vector DB 선택
pgvector: simple, Postgres 친화.
Pinecone: managed, 빠름.
Qdrant: open source, 빠름, hybrid built-in.
Weaviate: 큰 features.
Milvus: 큰 scale.
ChromaDB: 작은 / dev.
Chunk metadata
{
"id": "chunk-1",
"text": "...",
"embedding": [...],
"source": "doc.pdf",
"page": 3,
"section": "Introduction",
"category": "engineering",
"created_at": "2026-05-01"
}
→ Filter / citation 친화.
Production architecture
Doc upload → Parse → Chunk → Embed → Vector DB.
Query → Embed → Hybrid search → Rerank → LLM → Answer + Citation.
→ Chunking + ranking 가 가장 큰 quality lever.
Multi-modal RAG
Doc 가 image / table 도.
- Image embed (CLIP / Cohere multi-modal).
- Table → markdown.
- Combined search.
Long context vs RAG
Long context (200k):
- Simple, all in.
- Cost / latency 큰.
RAG:
- Top-K only.
- Cost / latency 작은.
- Tuning 필요.
→ < 50k = long context.
> 50k = RAG.
Cost / 1k query
Small RAG (10 chunks, GPT-4o-mini): $0.50.
Large RAG (50 chunks + rerank, GPT-4o): $50.
+ Embedding storage: $.
→ 매 query 가 multiple LLM call.
Limitation
- Lost in the middle (긴 context).
- Multi-hop reasoning (1 chunk 가 답 X).
- Negation ('이 가 아닌 것').
- Recent data (cutoff).
→ Agentic RAG / iterative 가 답.
Iterative RAG
def iterative_rag(query, max_steps=3):
context = ''
for step in range(max_steps):
new_query = llm.complete(f'Q: {query}\nKnown: {context}\nWhat else needed?')
docs = retrieve(new_query)
context += format(docs)
if llm.complete(f'Sufficient? Y/N {context}') == 'Y':
break
return llm.complete(f'Q: {query}\n{context}')
→ Multi-hop 의 답.
🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Document Q&A | RAG |
| Code search | Hybrid + AST chunk |
| Multi-hop | Agentic RAG |
| Real-time | Cached prompts |
| Production | Hybrid + rerank + eval |
| 작은 / quick | LangChain default |
❌ 안티패턴
- Vector 만: keyword 약함.
- Fixed chunk: boundary 깨짐.
- No rerank: noise.
- No citation: 신뢰 X.
- No eval: silent regression.
- Huge chunk: noise.
- Tiny chunk: context 잃음.
🤖 LLM 활용 힌트
- Recursive chunking + hybrid + rerank 가 baseline.
- Citation + eval 가 production.
- Iterative RAG 가 multi-hop.
- Continuous golden set update.