Files
2nd/10_Wiki/Topics/Coding/AI_RAG_Production.md
T
2026-05-10 22:08:15 +09:00

8.2 KiB
Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-rag-production RAG Production — chunking / re-rank / eval Coding draft B conceptual 2026-05-09 2026-05-09
ai
rag
production
vibe-coding
language applicable_to
TS / Python
AI
RAG production
document chunking
parent document
hybrid search
rerank
RAG eval

RAG Production

Demo RAG = simple. Production = chunking strategy + hybrid search + reranker + eval + monitoring.

📖 핵심 개념

  • Document → chunks → embed → vector store.
  • Query → retrieve → rerank → context.
  • Eval (recall, precision).
  • Continuous improvement (golden set).

💻 코드 패턴

Chunking strategy

# 1. Fixed size (단순)
def chunk_fixed(text, size=500, overlap=50):
    return [text[i:i+size] for i in range(0, len(text), size - overlap)]

# 2. Sentence-based
import re
def chunk_sentences(text, max_sentences=5):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [' '.join(sentences[i:i+max_sentences]) for i in range(0, len(sentences), max_sentences)]

# 3. Semantic (LLM-driven)
# 4. Markdown headers
# 5. Recursive (LangChain RecursiveCharacterTextSplitter)

Recursive chunking (best)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=['\n\n', '\n', '. ', ' ', ''],
)
chunks = splitter.split_text(text)

→ Boundary 보존 (paragraph → sentence → word).

Parent document retriever

# Small chunk = embed (precision).
# Big chunk (parent) = context (recall).

# Search small → return parent.
from langchain.retrievers import ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=...,
    docstore=...,
    child_splitter=child,  # 200 char
    parent_splitter=parent,  # 2000 char
)
// BM25 + vector (RRF)
const bm25Results = await bm25Search(query, 50);
const vecResults = await vectorSearch(query, 50);
const fused = rrf([bm25Results, vecResults]).slice(0, 20);

AI_Hybrid_Search_Patterns.

Reranker

from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

candidates = hybrid_search(query, k=50)
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
top = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]

→ Top-50 → top-5. Quality ↑.

Cohere Rerank

const r = await cohere.rerank({
  query, documents: candidates.map(c => c.text), topN: 5,
  model: 'rerank-english-v3.0',
});

→ Managed.

Query expansion

# LLM 가 query 재작성 (3 variant)
expanded = llm.complete(f'Generate 3 alternative phrasings of: "{query}"')
queries = [query, *expanded.split('\n')]

# 매 query 검색 + RRF
results = [vector_search(q, 20) for q in queries]
fused = rrf(results)

HyDE (Hypothetical Document Embedding)

# 가짜 답 생성 → embed → 검색
hypothetical = llm.complete(f'Detailed answer for: {query}')
emb = embed(hypothetical)
results = vector_search(emb, 20)

→ Query 가 짧음 = 답 의 embed 가 더 가까움.

Multi-vector

# Doc 의 매 section 가 own embed.
# 1 section hit → doc 가 결과.

Metadata filter

SELECT * FROM docs
WHERE category = $1 AND date > $2
ORDER BY embedding <=> $3
LIMIT 20;

→ Pre-filter (efficient).

Citation

# 매 chunk 의 source 보존.
prompt = f'''
Answer using ONLY:
[1] {chunks[0].text} (source: {chunks[0].source})
[2] {chunks[1].text}

Question: {query}

Cite [1], [2].
'''

→ User trust ↑.

Prompt template

SYSTEM = '''
Answer using ONLY the context. If unsure, say "I don't know".
Cite sources [1], [2].
'''

USER = f'''
Context:
{context}

Question: {query}

Answer:
'''

Eval (recall@K)

def recall_at_k(predicted_ids, gold_ids, k=5):
    return len(set(predicted_ids[:k]) & set(gold_ids)) / len(gold_ids)

# Golden set (curated)
gold = [{'query': 'X', 'relevant_docs': ['doc1', 'doc5']}]
results = [retrieve(q['query']) for q in gold]
recalls = [recall_at_k(r, q['relevant_docs']) for r, q in zip(results, gold)]
print(f'Avg recall: {sum(recalls)/len(recalls):.2f}')

LLM-judge eval

# Promptfoo / RAGAS
from ragas.metrics import faithfulness, answer_relevancy, context_precision

eval_dataset = [...]
result = evaluate(eval_dataset, [faithfulness, answer_relevancy, context_precision])

→ Faithfulness = answer 가 context 에서 나옴.

Monitoring (production)

@trace
def rag(query):
    docs = retrieve(query)
    answer = llm.complete(...)
    log({'query': query, 'doc_count': len(docs), 'tokens': ..., 'latency': ...})
    return answer

→ Helicone / LangSmith.

Cache

# Same query = cached result.
key = hashlib.sha256(query.encode()).hexdigest()
cached = cache.get(key)
if cached: return cached

# 또는 prompt cache (Anthropic / OpenAI).

Continuous improvement

1. Production query log.
2. Bad answer = manual review.
3. Add to golden set.
4. Re-eval → improve.
5. Re-deploy.

→ RAG quality 가 시간 따라 ↑.

Embedding model 선택

text-embedding-3-small (OpenAI): cheap, 좋은.
text-embedding-3-large: 더 정확.
voyage-3 / cohere embed-v3: SoTA.
BGE / e5 (open): self-host.

→ MTEB leaderboard 참고.

Re-embedding (model 변경)

새 model 가 더 좋음 → 모든 doc 재 embed.
- Cost 큰 (1M doc × $0.02 / M token).
- Time (수 시간).

→ Plan 가 필요.

Vector DB 선택

pgvector: simple, Postgres 친화.
Pinecone: managed, 빠름.
Qdrant: open source, 빠름, hybrid built-in.
Weaviate: 큰 features.
Milvus: 큰 scale.
ChromaDB: 작은 / dev.

DB_pgvector_Production.

Chunk metadata

{
  "id": "chunk-1",
  "text": "...",
  "embedding": [...],
  "source": "doc.pdf",
  "page": 3,
  "section": "Introduction",
  "category": "engineering",
  "created_at": "2026-05-01"
}

→ Filter / citation 친화.

Production architecture

Doc upload → Parse → Chunk → Embed → Vector DB.
Query → Embed → Hybrid search → Rerank → LLM → Answer + Citation.

→ Chunking + ranking 가 가장 큰 quality lever.

Multi-modal RAG

Doc 가 image / table 도.
- Image embed (CLIP / Cohere multi-modal).
- Table → markdown.
- Combined search.

Long context vs RAG

Long context (200k):
- Simple, all in.
- Cost / latency 큰.

RAG:
- Top-K only.
- Cost / latency 작은.
- Tuning 필요.

→ < 50k = long context.
> 50k = RAG.

Cost / 1k query

Small RAG (10 chunks, GPT-4o-mini): $0.50.
Large RAG (50 chunks + rerank, GPT-4o): $50.
+ Embedding storage: $.

→ 매 query 가 multiple LLM call.

Limitation

- Lost in the middle (긴 context).
- Multi-hop reasoning (1 chunk 가 답 X).
- Negation ('이 가 아닌 것').
- Recent data (cutoff).

→ Agentic RAG / iterative 가 답.

Iterative RAG

def iterative_rag(query, max_steps=3):
    context = ''
    for step in range(max_steps):
        new_query = llm.complete(f'Q: {query}\nKnown: {context}\nWhat else needed?')
        docs = retrieve(new_query)
        context += format(docs)
        if llm.complete(f'Sufficient? Y/N {context}') == 'Y':
            break
    return llm.complete(f'Q: {query}\n{context}')

→ Multi-hop 의 답.

🤔 의사결정 기준

작업 추천
Document Q&A RAG
Code search Hybrid + AST chunk
Multi-hop Agentic RAG
Real-time Cached prompts
Production Hybrid + rerank + eval
작은 / quick LangChain default

안티패턴

  • Vector 만: keyword 약함.
  • Fixed chunk: boundary 깨짐.
  • No rerank: noise.
  • No citation: 신뢰 X.
  • No eval: silent regression.
  • Huge chunk: noise.
  • Tiny chunk: context 잃음.

🤖 LLM 활용 힌트

  • Recursive chunking + hybrid + rerank 가 baseline.
  • Citation + eval 가 production.
  • Iterative RAG 가 multi-hop.
  • Continuous golden set update.

🔗 관련 문서