--- id: ai-rag-production title: RAG Production β€” chunking / re-rank / eval category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, rag, production, vibe-coding] tech_stack: { language: "TS / Python", applicable_to: ["AI"] } applied_in: [] aliases: [RAG production, document chunking, parent document, hybrid search, rerank, RAG eval] --- # RAG Production > Demo RAG = simple. **Production = chunking strategy + hybrid search + reranker + eval + monitoring**. ## πŸ“– 핡심 κ°œλ… - Document β†’ chunks β†’ embed β†’ vector store. - Query β†’ retrieve β†’ rerank β†’ context. - Eval (recall, precision). - Continuous improvement (golden set). ## πŸ’» μ½”λ“œ νŒ¨ν„΄ ### Chunking strategy ```python # 1. Fixed size (λ‹¨μˆœ) def chunk_fixed(text, size=500, overlap=50): return [text[i:i+size] for i in range(0, len(text), size - overlap)] # 2. Sentence-based import re def chunk_sentences(text, max_sentences=5): sentences = re.split(r'(?<=[.!?])\s+', text) return [' '.join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)] # 3. Semantic (LLM-driven) # 4. Markdown headers # 5. Recursive (LangChain RecursiveCharacterTextSplitter) ``` ### Recursive chunking (best) ```python from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=['\n\n', '\n', '. ', ' ', ''], ) chunks = splitter.split_text(text) ``` β†’ Boundary 보쑴 (paragraph β†’ sentence β†’ word). ### Parent document retriever ```python # Small chunk = embed (precision). # Big chunk (parent) = context (recall). # Search small β†’ return parent. ``` ```python from langchain.retrievers import ParentDocumentRetriever retriever = ParentDocumentRetriever( vectorstore=..., docstore=..., child_splitter=child, # 200 char parent_splitter=parent, # 2000 char ) ``` ### Hybrid search ```ts // BM25 + vector (RRF) const bm25Results = await bm25Search(query, 50); const vecResults = await vectorSearch(query, 50); const fused = rrf([bm25Results, vecResults]).slice(0, 20); ``` β†’ [[AI_Hybrid_Search_Patterns]]. ### Reranker ```python from sentence_transformers import CrossEncoder reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') candidates = hybrid_search(query, k=50) pairs = [(query, c.text) for c in candidates] scores = reranker.predict(pairs) top = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5] ``` β†’ Top-50 β†’ top-5. Quality ↑. ### Cohere Rerank ```ts const r = await cohere.rerank({ query, documents: candidates.map(c => c.text), topN: 5, model: 'rerank-english-v3.0', }); ``` β†’ Managed. ### Query expansion ```python # LLM κ°€ query μž¬μž‘μ„± (3 variant) expanded = llm.complete(f'Generate 3 alternative phrasings of: "{query}"') queries = [query, *expanded.split('\n')] # λ§€ query 검색 + RRF results = [vector_search(q, 20) for q in queries] fused = rrf(results) ``` ### HyDE (Hypothetical Document Embedding) ```python # κ°€μ§œ λ‹΅ 생성 β†’ embed β†’ 검색 hypothetical = llm.complete(f'Detailed answer for: {query}') emb = embed(hypothetical) results = vector_search(emb, 20) ``` β†’ Query κ°€ 짧음 = λ‹΅ 의 embed κ°€ 더 κ°€κΉŒμ›€. ### Multi-vector ```python # Doc 의 λ§€ section κ°€ own embed. # 1 section hit β†’ doc κ°€ κ²°κ³Ό. ``` ### Metadata filter ```sql SELECT * FROM docs WHERE category = $1 AND date > $2 ORDER BY embedding <=> $3 LIMIT 20; ``` β†’ Pre-filter (efficient). ### Citation ```python # λ§€ chunk 의 source 보쑴. prompt = f''' Answer using ONLY: [1] {chunks[0].text} (source: {chunks[0].source}) [2] {chunks[1].text} Question: {query} Cite [1], [2]. ''' ``` β†’ User trust ↑. ### Prompt template ```python SYSTEM = ''' Answer using ONLY the context. If unsure, say "I don't know". Cite sources [1], [2]. ''' USER = f''' Context: {context} Question: {query} Answer: ''' ``` ### Eval (recall@K) ```python def recall_at_k(predicted_ids, gold_ids, k=5): return len(set(predicted_ids[:k]) & set(gold_ids)) / len(gold_ids) # Golden set (curated) gold = [{'query': 'X', 'relevant_docs': ['doc1', 'doc5']}] results = [retrieve(q['query']) for q in gold] recalls = [recall_at_k(r, q['relevant_docs']) for r, q in zip(results, gold)] print(f'Avg recall: {sum(recalls)/len(recalls):.2f}') ``` ### LLM-judge eval ```python # Promptfoo / RAGAS from ragas.metrics import faithfulness, answer_relevancy, context_precision eval_dataset = [...] result = evaluate(eval_dataset, [faithfulness, answer_relevancy, context_precision]) ``` β†’ Faithfulness = answer κ°€ context μ—μ„œ λ‚˜μ˜΄. ### Monitoring (production) ```python @trace def rag(query): docs = retrieve(query) answer = llm.complete(...) log({'query': query, 'doc_count': len(docs), 'tokens': ..., 'latency': ...}) return answer ``` β†’ Helicone / LangSmith. ### Cache ```python # Same query = cached result. key = hashlib.sha256(query.encode()).hexdigest() cached = cache.get(key) if cached: return cached # λ˜λŠ” prompt cache (Anthropic / OpenAI). ``` ### Continuous improvement ``` 1. Production query log. 2. Bad answer = manual review. 3. Add to golden set. 4. Re-eval β†’ improve. 5. Re-deploy. ``` β†’ RAG quality κ°€ μ‹œκ°„ 따라 ↑. ### Embedding model 선택 ``` text-embedding-3-small (OpenAI): cheap, 쒋은. text-embedding-3-large: 더 μ •ν™•. voyage-3 / cohere embed-v3: SoTA. BGE / e5 (open): self-host. ``` β†’ MTEB leaderboard μ°Έκ³ . ### Re-embedding (model λ³€κ²½) ``` μƒˆ model κ°€ 더 μ’‹μŒ β†’ λͺ¨λ“  doc 재 embed. - Cost 큰 (1M doc Γ— $0.02 / M token). - Time (수 μ‹œκ°„). ``` β†’ Plan κ°€ ν•„μš”. ### Vector DB 선택 ``` pgvector: simple, Postgres μΉœν™”. Pinecone: managed, 빠름. Qdrant: open source, 빠름, hybrid built-in. Weaviate: 큰 features. Milvus: 큰 scale. ChromaDB: μž‘μ€ / dev. ``` β†’ [[DB_pgvector_Production]]. ### Chunk metadata ```json { "id": "chunk-1", "text": "...", "embedding": [...], "source": "doc.pdf", "page": 3, "section": "Introduction", "category": "engineering", "created_at": "2026-05-01" } ``` β†’ Filter / citation μΉœν™”. ### Production architecture ``` Doc upload β†’ Parse β†’ Chunk β†’ Embed β†’ Vector DB. Query β†’ Embed β†’ Hybrid search β†’ Rerank β†’ LLM β†’ Answer + Citation. β†’ Chunking + ranking κ°€ κ°€μž₯ 큰 quality lever. ``` ### Multi-modal RAG ``` Doc κ°€ image / table 도. - Image embed (CLIP / Cohere multi-modal). - Table β†’ markdown. - Combined search. ``` ### Long context vs RAG ``` Long context (200k): - Simple, all in. - Cost / latency 큰. RAG: - Top-K only. - Cost / latency μž‘μ€. - Tuning ν•„μš”. β†’ < 50k = long context. > 50k = RAG. ``` ### Cost / 1k query ``` Small RAG (10 chunks, GPT-4o-mini): $0.50. Large RAG (50 chunks + rerank, GPT-4o): $50. + Embedding storage: $. β†’ λ§€ query κ°€ multiple LLM call. ``` ### Limitation ``` - Lost in the middle (κΈ΄ context). - Multi-hop reasoning (1 chunk κ°€ λ‹΅ X). - Negation ('이 κ°€ μ•„λ‹Œ 것'). - Recent data (cutoff). ``` β†’ Agentic RAG / iterative κ°€ λ‹΅. ### Iterative RAG ```python def iterative_rag(query, max_steps=3): context = '' for step in range(max_steps): new_query = llm.complete(f'Q: {query}\nKnown: {context}\nWhat else needed?') docs = retrieve(new_query) context += format(docs) if llm.complete(f'Sufficient? Y/N {context}') == 'Y': break return llm.complete(f'Q: {query}\n{context}') ``` β†’ Multi-hop 의 λ‹΅. ## πŸ€” μ˜μ‚¬κ²°μ • κΈ°μ€€ | μž‘μ—… | μΆ”μ²œ | |---|---| | Document Q&A | RAG | | Code search | Hybrid + AST chunk | | Multi-hop | Agentic RAG | | Real-time | Cached prompts | | Production | Hybrid + rerank + eval | | μž‘μ€ / quick | LangChain default | ## ❌ μ•ˆν‹°νŒ¨ν„΄ - **Vector 만**: keyword 약함. - **Fixed chunk**: boundary 깨짐. - **No rerank**: noise. - **No citation**: μ‹ λ’° X. - **No eval**: silent regression. - **Huge chunk**: noise. - **Tiny chunk**: context μžƒμŒ. ## πŸ€– LLM ν™œμš© 힌트 - Recursive chunking + hybrid + rerank κ°€ baseline. - Citation + eval κ°€ production. - Iterative RAG κ°€ multi-hop. - Continuous golden set update. ## πŸ”— κ΄€λ ¨ λ¬Έμ„œ - [[AI_RAG_Pattern_Basics]] - [[AI_RAG_Advanced]] - [[AI_Hybrid_Search_Patterns]]