---
id: ai-rag-production
title: RAG Production — chunking / re-rank / eval
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, rag, production, vibe-coding]
tech_stack: { language: "TS / Python", applicable_to: ["AI"] }
applied_in: []
aliases: [RAG production, document chunking, parent document, hybrid search, rerank, RAG eval]
---

# RAG Production

> Demo RAG = simple. **Production = chunking strategy + hybrid search + reranker + eval + monitoring**.

## 📖 핵심 개념
- Document → chunks → embed → vector store.
- Query → retrieve → rerank → context.
- Eval (recall, precision).
- Continuous improvement (golden set).

## 💻 코드 패턴

### Chunking strategy
```python
# 1. Fixed size (단순)
def chunk_fixed(text, size=500, overlap=50):
    return [text[i:i+size] for i in range(0, len(text), size - overlap)]

# 2. Sentence-based
import re
def chunk_sentences(text, max_sentences=5):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [' '.join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]

# 3. Semantic (LLM-driven)
# 4. Markdown headers
# 5. Recursive (LangChain RecursiveCharacterTextSplitter)
```

### Recursive chunking (best)
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=['\n\n', '\n', '. ', ' ', ''],
)
chunks = splitter.split_text(text)
```

→ Boundary 보존 (paragraph → sentence → word).

### Parent document retriever
```python
# Small chunk = embed (precision).
# Big chunk (parent) = context (recall).

# Search small → return parent.
```

```python
from langchain.retrievers import ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=...,
    docstore=...,
    child_splitter=child,  # 200 char
    parent_splitter=parent,  # 2000 char
)
```

### Hybrid search
```ts
// BM25 + vector (RRF)
const bm25Results = await bm25Search(query, 50);
const vecResults = await vectorSearch(query, 50);
const fused = rrf([bm25Results, vecResults]).slice(0, 20);
```

→ [[AI_Hybrid_Search_Patterns]].

### Reranker
```python
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

candidates = hybrid_search(query, k=50)
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
top = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]
```

→ Top-50 → top-5. Quality ↑.

### Cohere Rerank
```ts
const r = await cohere.rerank({
  query, documents: candidates.map(c => c.text), topN: 5,
  model: 'rerank-english-v3.0',
});
```

→ Managed.

### Query expansion
```python
# LLM 가 query 재작성 (3 variant)
expanded = llm.complete(f'Generate 3 alternative phrasings of: "{query}"')
queries = [query, *expanded.split('\n')]

# 매 query 검색 + RRF
results = [vector_search(q, 20) for q in queries]
fused = rrf(results)
```

### HyDE (Hypothetical Document Embedding)
```python
# 가짜 답 생성 → embed → 검색
hypothetical = llm.complete(f'Detailed answer for: {query}')
emb = embed(hypothetical)
results = vector_search(emb, 20)
```

→ Query 가 짧음 = 답 의 embed 가 더 가까움.

### Multi-vector
```python
# Doc 의 매 section 가 own embed.
# 1 section hit → doc 가 결과.
```

### Metadata filter
```sql
SELECT * FROM docs
WHERE category = $1 AND date > $2
ORDER BY embedding <=> $3
LIMIT 20;
```

→ Pre-filter (efficient).

### Citation
```python
# 매 chunk 의 source 보존.
prompt = f'''
Answer using ONLY:
[1] {chunks[0].text} (source: {chunks[0].source})
[2] {chunks[1].text}

Question: {query}

Cite [1], [2].
'''
```

→ User trust ↑.

### Prompt template
```python
SYSTEM = '''
Answer using ONLY the context. If unsure, say "I don't know".
Cite sources [1], [2].
'''

USER = f'''
Context:
{context}

Question: {query}

Answer:
'''
```

### Eval (recall@K)
```python
def recall_at_k(predicted_ids, gold_ids, k=5):
    return len(set(predicted_ids[:k]) & set(gold_ids)) / len(gold_ids)

# Golden set (curated)
gold = [{'query': 'X', 'relevant_docs': ['doc1', 'doc5']}]
results = [retrieve(q['query']) for q in gold]
recalls = [recall_at_k(r, q['relevant_docs']) for r, q in zip(results, gold)]
print(f'Avg recall: {sum(recalls)/len(recalls):.2f}')
```

### LLM-judge eval
```python
# Promptfoo / RAGAS
from ragas.metrics import faithfulness, answer_relevancy, context_precision

eval_dataset = [...]
result = evaluate(eval_dataset, [faithfulness, answer_relevancy, context_precision])
```

→ Faithfulness = answer 가 context 에서 나옴.

### Monitoring (production)
```python
@trace
def rag(query):
    docs = retrieve(query)
    answer = llm.complete(...)
    log({'query': query, 'doc_count': len(docs), 'tokens': ..., 'latency': ...})
    return answer
```

→ Helicone / LangSmith.

### Cache
```python
# Same query = cached result.
key = hashlib.sha256(query.encode()).hexdigest()
cached = cache.get(key)
if cached: return cached

# 또는 prompt cache (Anthropic / OpenAI).
```

### Continuous improvement
```
1. Production query log.
2. Bad answer = manual review.
3. Add to golden set.
4. Re-eval → improve.
5. Re-deploy.
```

→ RAG quality 가 시간 따라 ↑.

### Embedding model 선택
```
text-embedding-3-small (OpenAI): cheap, 좋은.
text-embedding-3-large: 더 정확.
voyage-3 / cohere embed-v3: SoTA.
BGE / e5 (open): self-host.
```

→ MTEB leaderboard 참고.

### Re-embedding (model 변경)
```
새 model 가 더 좋음 → 모든 doc 재 embed.
- Cost 큰 (1M doc × $0.02 / M token).
- Time (수 시간).
```

→ Plan 가 필요.

### Vector DB 선택
```
pgvector: simple, Postgres 친화.
Pinecone: managed, 빠름.
Qdrant: open source, 빠름, hybrid built-in.
Weaviate: 큰 features.
Milvus: 큰 scale.
ChromaDB: 작은 / dev.
```

→ [[DB_pgvector_Production]].

### Chunk metadata
```json
{
  "id": "chunk-1",
  "text": "...",
  "embedding": [...],
  "source": "doc.pdf",
  "page": 3,
  "section": "Introduction",
  "category": "engineering",
  "created_at": "2026-05-01"
}
```

→ Filter / citation 친화.

### Production architecture
```
Doc upload → Parse → Chunk → Embed → Vector DB.
Query → Embed → Hybrid search → Rerank → LLM → Answer + Citation.

→ Chunking + ranking 가 가장 큰 quality lever.
```

### Multi-modal RAG
```
Doc 가 image / table 도.
- Image embed (CLIP / Cohere multi-modal).
- Table → markdown.
- Combined search.
```

### Long context vs RAG
```
Long context (200k):
- Simple, all in.
- Cost / latency 큰.

RAG:
- Top-K only.
- Cost / latency 작은.
- Tuning 필요.

→ < 50k = long context.
> 50k = RAG.
```

### Cost / 1k query
```
Small RAG (10 chunks, GPT-4o-mini): $0.50.
Large RAG (50 chunks + rerank, GPT-4o): $50.
+ Embedding storage: $.

→ 매 query 가 multiple LLM call.
```

### Limitation
```
- Lost in the middle (긴 context).
- Multi-hop reasoning (1 chunk 가 답 X).
- Negation ('이 가 아닌 것').
- Recent data (cutoff).
```

→ Agentic RAG / iterative 가 답.

### Iterative RAG
```python
def iterative_rag(query, max_steps=3):
    context = ''
    for step in range(max_steps):
        new_query = llm.complete(f'Q: {query}\nKnown: {context}\nWhat else needed?')
        docs = retrieve(new_query)
        context += format(docs)
        if llm.complete(f'Sufficient? Y/N {context}') == 'Y':
            break
    return llm.complete(f'Q: {query}\n{context}')
```

→ Multi-hop 의 답.

## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Document Q&A | RAG |
| Code search | Hybrid + AST chunk |
| Multi-hop | Agentic RAG |
| Real-time | Cached prompts |
| Production | Hybrid + rerank + eval |
| 작은 / quick | LangChain default |

## ❌ 안티패턴
- **Vector 만**: keyword 약함.
- **Fixed chunk**: boundary 깨짐.
- **No rerank**: noise.
- **No citation**: 신뢰 X.
- **No eval**: silent regression.
- **Huge chunk**: noise.
- **Tiny chunk**: context 잃음.

## 🤖 LLM 활용 힌트
- Recursive chunking + hybrid + rerank 가 baseline.
- Citation + eval 가 production.
- Iterative RAG 가 multi-hop.
- Continuous golden set update.

## 🔗 관련 문서
- [[AI_RAG_Pattern_Basics]]
- [[AI_RAG_Advanced]]
- [[AI_Hybrid_Search_Patterns]]