"매 retrieval은 recall, 매 rerank는 precision". Reranking은 매 first-stage retrieval (BM25/dense) 에서 매 top-k candidates를 매 expensive cross-encoder/LLM으로 매 re-score — RAG quality 의 매 single biggest lever in 2026 (Cohere Rerank 4, BGE-Reranker-v2.5, Voyage rerank-3).
매 핵심
매 왜 필요
Bi-encoder (dense retrieval): query, doc를 매 separately encode → cosine. Fast (cached doc embeddings) but 매 shallow interaction.
Cross-encoder: [query, doc] 의 매 jointly encode → scalar score. 매 deep token-level attention → +10–30% NDCG.
Trade-off: O(N) cross-encoder 의 매 too slow → first-stage retrieve top-100, rerank to top-5.
매 Architectures
Cross-encoder (BERT-based): [CLS] q [SEP] d [SEP] → linear → score. BGE-Reranker-v2.5, Cohere Rerank 4, Voyage rerank-3.
ColBERT / late interaction: doc의 매 token-level embeddings 매 미리 계산 → query token이 매 max-sim로 score. Cross-encoder의 매 ~80% quality at retrieval-speed.
LLM-as-reranker: prompt 의 GPT-5/Claude 매 listwise rank. RankGPT, RankZephyr 매 paradigm — 매 quality 최고지만 매 가장 비쌈.
defrrf(rankings:list[list[str]],k:int=60)->list[str]:"""rankings: list of ranked doc-id lists from different retrievers."""scores:dict[str,float]={}forrankinginrankings:forrank,doc_idinenumerate(ranking):scores[doc_id]=scores.get(doc_id,0)+1/(k+rank+1)returnsorted(scores,key=scores.get,reverse=True)
언제: high-stakes RAG (legal/medical/finance), small candidate set, listwise.
언제 X: 매 latency budget < 100ms, 매 large k (cost), 매 simple FAQ chat (overkill).
❌ 안티패턴
Rerank without first-stage filter: O(N) on full corpus → cost explosion.
Cross-encoder for indexing: 매 doc embeddings 의 매 cache 의 X — 매 query마다 recompute.
Pointwise LLM rerank: 매 doc 별 separate call → listwise보다 매 비싸고 inconsistent.
Ignoring score calibration: cross-encoder score는 매 not probability — threshold 매 dataset-specific tuning 필요.