d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7.3 KiB
7.3 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-information-retrieval-ir | Information Retrieval (IR) | 10_Wiki/Topics | verified | self |
|
none | A | 0.97 | applied |
|
2026-05-10 | pending |
|
Information Retrieval (IR)
매 한 줄
"매 query 의 의 의 매 relevant document 의 의 의 의 retrieve". 매 BM25 (sparse), 매 dense (vector), 매 hybrid. 매 modern: 매 dense + cross-encoder rerank, 매 RAG 의 backbone.
매 핵심
매 method
- Sparse: TF-IDF, BM25 (Okapi).
- Dense: embedding cosine.
- Hybrid: BM25 + dense.
- Cross-encoder: 매 rerank.
- Learned sparse: SPLADE.
- ColBERT: 매 late interaction.
매 metric
- Precision@k, Recall@k.
- MRR (Mean Reciprocal Rank).
- NDCG (graded relevance).
- MAP (Mean Average Precision).
매 응용
- Search engine.
- RAG.
- E-commerce search.
- Q&A.
- Code search.
💻 패턴
BM25 (rank_bm25)
from rank_bm25 import BM25Okapi
docs = [d.split() for d in corpus]
bm25 = BM25Okapi(docs)
scores = bm25.get_scores('search query'.split())
top = sorted(zip(corpus, scores), key=lambda x: -x[1])[:5]
Dense retrieval (FAISS)
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
m = SentenceTransformer('all-mpnet-base-v2')
corpus_emb = m.encode(corpus)
index = faiss.IndexFlatIP(corpus_emb.shape[1])
faiss.normalize_L2(corpus_emb)
index.add(corpus_emb)
query_emb = m.encode(['my query'])
faiss.normalize_L2(query_emb)
D, I = index.search(query_emb, k=5)
Hybrid (RRF — Reciprocal Rank Fusion)
def reciprocal_rank_fusion(rankings, k=60):
"""매 매 ranking source 의 fuse."""
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking, 1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
Cross-encoder rerank
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, candidates, k=5):
pairs = [[query, c] for c in candidates]
scores = reranker.predict(pairs)
return [c for _, c in sorted(zip(scores, candidates), reverse=True)][:k]
Elasticsearch (production)
from elasticsearch import Elasticsearch
es = Elasticsearch()
# 매 hybrid search (BM25 + kNN)
res = es.search(index='docs', body={
'query': {
'bool': {
'should': [
{'multi_match': {'query': 'my query', 'fields': ['title', 'body']}},
{'knn': {'field': 'embedding', 'query_vector': query_emb, 'k': 10, 'num_candidates': 100}},
]
}
},
'size': 10,
})
Vespa (streaming + ML)
schema doc {
document doc {
field title type string { indexing: index | summary }
field embedding type tensor<float>(x[768]) { indexing: attribute | index }
}
rank-profile hybrid {
first-phase {
expression: 0.5 * bm25(title) + 0.5 * closeness(field, embedding)
}
second-phase {
expression: cross_encoder_score
}
}
}
ColBERT (late interaction)
# 매 매 token-level interaction
from colbert.infra import ColBERTConfig
from colbert import Searcher
config = ColBERTConfig(nbits=2, root='./experiments')
searcher = Searcher(index='index_name', config=config)
results = searcher.search(query='my query', k=10)
MMR (diversity)
def mmr(query_emb, candidates_emb, k=5, lam=0.5):
selected = []
selected_emb = []
while len(selected) < k and candidates_emb.size > 0:
scores = []
for i, c_emb in enumerate(candidates_emb):
rel = cosine(query_emb, c_emb)
div = max((cosine(c_emb, s) for s in selected_emb), default=0)
scores.append(lam * rel - (1 - lam) * div)
best = np.argmax(scores)
selected.append(best)
selected_emb.append(candidates_emb[best])
candidates_emb = np.delete(candidates_emb, best, axis=0)
return selected
Eval (MRR)
def mrr(predictions, gold_doc_ids):
"""매 매 query 의 first relevant rank."""
reciprocals = []
for pred, gold in zip(predictions, gold_doc_ids):
for rank, doc_id in enumerate(pred, 1):
if doc_id in gold:
reciprocals.append(1 / rank)
break
else:
reciprocals.append(0)
return np.mean(reciprocals)
NDCG
from sklearn.metrics import ndcg_score
def ndcg_at_k(predictions, relevance, k=10):
return ndcg_score(relevance, predictions, k=k)
Negative mining
def hard_negative_mining(model, query, gold_doc, candidates):
"""매 매 hard negatives 의 의 train pair."""
scores = model.predict([[query, c] for c in candidates])
# 매 high-scoring 의 의 negatives
return [c for s, c in sorted(zip(scores, candidates), reverse=True) if c != gold_doc][:5]
Index update (incremental)
def upsert_doc(index, doc_id, doc, model):
emb = model.encode(doc)
index.upsert(doc_id, doc, emb)
LLM-as-judge for relevance
def llm_judge_relevance(query, doc, llm):
prompt = f"""Rate relevance 0-3.
Query: {query}
Doc: {doc}
Output: integer."""
return int(llm.generate(prompt).strip())
Query expansion
def query_expand(query, llm):
"""매 LLM 의 의 query 의 expand."""
return llm.generate(f"Generate 3 alternative phrasings: {query}").split('\n')
RAG-fit chunking
def chunk_for_rag(text, chunk_size=500, overlap=100):
chunks = []
i = 0
while i < len(text):
chunks.append(text[i:i + chunk_size])
i += chunk_size - overlap
return chunks
매 결정 기준
| 상황 | Approach |
|---|---|
| Keyword | BM25 |
| Semantic | Dense |
| Best quality | Hybrid + cross-encoder |
| Web-scale | Vespa / Elasticsearch |
| Serverless | Pinecone / Weaviate |
| Open-source | FAISS + ES |
| RAG | Hybrid + chunk + rerank |
기본값: 매 hybrid (BM25 + dense) + 매 cross-encoder rerank + 매 MMR diversity + 매 NDCG eval.
🔗 Graph
- 부모: Search · NLP
- 변형: BM25 · Dense-Retrieval · Hybrid Search
- 응용: RAG · Search-Engine
- Adjacent: Elasticsearch · FAISS · ColBERT
🤖 LLM 활용
언제: 매 search. 매 RAG. 매 Q&A. 언제 X: 매 small / static dataset.
❌ 안티패턴
- Dense-only: 매 keyword 의 lose.
- No rerank: 매 final quality 의 ↓.
- Expensive cross-encoder on full corpus: 매 latency.
- No diversity: 매 echo.
- Fixed chunk regardless content: 매 break sentence.
🧪 검증 / 중복
- Verified (Robertson BM25, Karpukhin DPR 2020, Khattab ColBERT 2020).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — methods + 매 BM25 / FAISS / RRF / rerank / MMR / ColBERT code |