--- id: wiki-2026-0508-information-retrieval-ir title: Information Retrieval (IR) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [IR, information retrieval, search engine, BM25, dense retrieval, hybrid search, RAG] duplicate_of: none source_trust_level: A confidence_score: 0.97 verification_status: applied tags: [search, ir, bm25, dense-retrieval, vector-search, rag, elasticsearch] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python / TypeScript framework: Elasticsearch / Vespa / FAISS / Pinecone --- # Information Retrieval (IR) ## 매 한 줄 > **"매 query 의 의 의 매 relevant document 의 의 의 의 retrieve"**. 매 BM25 (sparse), 매 dense (vector), 매 hybrid. 매 modern: 매 dense + cross-encoder rerank, 매 RAG 의 backbone. ## 매 핵심 ### 매 method - **Sparse**: TF-IDF, BM25 (Okapi). - **Dense**: embedding cosine. - **Hybrid**: BM25 + dense. - **Cross-encoder**: 매 rerank. - **Learned sparse**: SPLADE. - **ColBERT**: 매 late interaction. ### 매 metric - **Precision@k, Recall@k**. - **MRR** (Mean Reciprocal Rank). - **NDCG** (graded relevance). - **MAP** (Mean Average Precision). ### 매 응용 1. **Search engine**. 2. **RAG**. 3. **E-commerce search**. 4. **Q&A**. 5. **Code search**. ## 💻 패턴 ### BM25 (rank_bm25) ```python from rank_bm25 import BM25Okapi docs = [d.split() for d in corpus] bm25 = BM25Okapi(docs) scores = bm25.get_scores('search query'.split()) top = sorted(zip(corpus, scores), key=lambda x: -x[1])[:5] ``` ### Dense retrieval (FAISS) ```python import faiss import numpy as np from sentence_transformers import SentenceTransformer m = SentenceTransformer('all-mpnet-base-v2') corpus_emb = m.encode(corpus) index = faiss.IndexFlatIP(corpus_emb.shape[1]) faiss.normalize_L2(corpus_emb) index.add(corpus_emb) query_emb = m.encode(['my query']) faiss.normalize_L2(query_emb) D, I = index.search(query_emb, k=5) ``` ### Hybrid (RRF — Reciprocal Rank Fusion) ```python def reciprocal_rank_fusion(rankings, k=60): """매 매 ranking source 의 fuse.""" scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking, 1): scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank) return sorted(scores.items(), key=lambda x: -x[1]) ``` ### Cross-encoder rerank ```python from sentence_transformers import CrossEncoder reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') def rerank(query, candidates, k=5): pairs = [[query, c] for c in candidates] scores = reranker.predict(pairs) return [c for _, c in sorted(zip(scores, candidates), reverse=True)][:k] ``` ### Elasticsearch (production) ```python from elasticsearch import Elasticsearch es = Elasticsearch() # 매 hybrid search (BM25 + kNN) res = es.search(index='docs', body={ 'query': { 'bool': { 'should': [ {'multi_match': {'query': 'my query', 'fields': ['title', 'body']}}, {'knn': {'field': 'embedding', 'query_vector': query_emb, 'k': 10, 'num_candidates': 100}}, ] } }, 'size': 10, }) ``` ### Vespa (streaming + ML) ```yaml schema doc { document doc { field title type string { indexing: index | summary } field embedding type tensor(x[768]) { indexing: attribute | index } } rank-profile hybrid { first-phase { expression: 0.5 * bm25(title) + 0.5 * closeness(field, embedding) } second-phase { expression: cross_encoder_score } } } ``` ### ColBERT (late interaction) ```python # 매 매 token-level interaction from colbert.infra import ColBERTConfig from colbert import Searcher config = ColBERTConfig(nbits=2, root='./experiments') searcher = Searcher(index='index_name', config=config) results = searcher.search(query='my query', k=10) ``` ### MMR (diversity) ```python def mmr(query_emb, candidates_emb, k=5, lam=0.5): selected = [] selected_emb = [] while len(selected) < k and candidates_emb.size > 0: scores = [] for i, c_emb in enumerate(candidates_emb): rel = cosine(query_emb, c_emb) div = max((cosine(c_emb, s) for s in selected_emb), default=0) scores.append(lam * rel - (1 - lam) * div) best = np.argmax(scores) selected.append(best) selected_emb.append(candidates_emb[best]) candidates_emb = np.delete(candidates_emb, best, axis=0) return selected ``` ### Eval (MRR) ```python def mrr(predictions, gold_doc_ids): """매 매 query 의 first relevant rank.""" reciprocals = [] for pred, gold in zip(predictions, gold_doc_ids): for rank, doc_id in enumerate(pred, 1): if doc_id in gold: reciprocals.append(1 / rank) break else: reciprocals.append(0) return np.mean(reciprocals) ``` ### NDCG ```python from sklearn.metrics import ndcg_score def ndcg_at_k(predictions, relevance, k=10): return ndcg_score(relevance, predictions, k=k) ``` ### Negative mining ```python def hard_negative_mining(model, query, gold_doc, candidates): """매 매 hard negatives 의 의 train pair.""" scores = model.predict([[query, c] for c in candidates]) # 매 high-scoring 의 의 negatives return [c for s, c in sorted(zip(scores, candidates), reverse=True) if c != gold_doc][:5] ``` ### Index update (incremental) ```python def upsert_doc(index, doc_id, doc, model): emb = model.encode(doc) index.upsert(doc_id, doc, emb) ``` ### LLM-as-judge for relevance ```python def llm_judge_relevance(query, doc, llm): prompt = f"""Rate relevance 0-3. Query: {query} Doc: {doc} Output: integer.""" return int(llm.generate(prompt).strip()) ``` ### Query expansion ```python def query_expand(query, llm): """매 LLM 의 의 query 의 expand.""" return llm.generate(f"Generate 3 alternative phrasings: {query}").split('\n') ``` ### RAG-fit chunking ```python def chunk_for_rag(text, chunk_size=500, overlap=100): chunks = [] i = 0 while i < len(text): chunks.append(text[i:i + chunk_size]) i += chunk_size - overlap return chunks ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Keyword | BM25 | | Semantic | Dense | | Best quality | Hybrid + cross-encoder | | Web-scale | Vespa / Elasticsearch | | Serverless | Pinecone / Weaviate | | Open-source | FAISS + ES | | RAG | Hybrid + chunk + rerank | **기본값**: 매 hybrid (BM25 + dense) + 매 cross-encoder rerank + 매 MMR diversity + 매 NDCG eval. ## 🔗 Graph - 부모: [[Search]] · [[NLP]] - 변형: [[BM25]] · [[Dense-Retrieval]] · [[Hybrid-Search]] - 응용: [[RAG]] · [[Search-Engine]] - Adjacent: [[Elasticsearch]] · [[FAISS]] · [[ColBERT]] ## 🤖 LLM 활용 **언제**: 매 search. 매 RAG. 매 Q&A. **언제 X**: 매 small / static dataset. ## ❌ 안티패턴 - **Dense-only**: 매 keyword 의 lose. - **No rerank**: 매 final quality 의 ↓. - **Expensive cross-encoder on full corpus**: 매 latency. - **No diversity**: 매 echo. - **Fixed chunk regardless content**: 매 break sentence. ## 🧪 검증 / 중복 - Verified (Robertson BM25, Karpukhin DPR 2020, Khattab ColBERT 2020). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — methods + 매 BM25 / FAISS / RRF / rerank / MMR / ColBERT code |