"매 user judgment (or top-k assumption) 매 활용하여 query 매 reformulate". 매 1971 Rocchio 의 vector-space RF에서 시작, 매 BM25 시대 PRF (Robertson) 매 standard 매 query expansion 기법, 매 2026 dense retrieval 시대에 매 ANCE/HyDE/LLM-RankFusion 형태로 매 evolve — RAG pipelines 의 매 retrieval quality boost lever.
매 핵심
매 종류
Explicit RF: user marks results relevant/not.
Implicit RF: clicks, dwell time, scroll signals.
Pseudo RF (PRF): assume top-k from initial retrieval are relevant.
매 Rocchio (vector space)
q_new = α·q_old + β·(1/|Dr|)Σ_{d∈Dr} d - γ·(1/|Dnr|)Σ_{d∈Dnr} d.
α, β, γ ≥ 0; typical (1, 0.75, 0.15).
매 BM25 + RM3 (probabilistic PRF)
Top-k pseudo-relevant set R.
Term weights via RM (relevance model): p(t|R) ∝ Σ_{d∈R} p(d)p(t|d).
Mix with original query: q' = (1-λ)q + λ·top terms from RM.
frompyserini.search.luceneimportLuceneSearchersearcher=LuceneSearcher('indexes/msmarco-passage')searcher.set_bm25(k1=0.82,b=0.68)searcher.set_rm3(fb_terms=10,fb_docs=10,original_query_weight=0.5)hits=searcher.search('how does covid spread',k=20)
HyDE (LLM hypothetical doc → embed)
defhyde(query,llm,embedder,retriever,k=10):prompt=f"Write a passage that answers: {query}"hypo=llm(prompt,max_tokens=200)# Claude Opus 4.7hypo_emb=embedder.encode(hypo)returnretriever.search(hypo_emb,top_k=k)
defllm_expand(query,llm):prompt=f"""Given the search query: "{query}"
Generate 5 alternative phrasings and 5 related technical terms.
Return as JSON: {{"phrasings": [...], "terms": [...]}}"""returnparse_json(llm(prompt))
Click-model implicit RF
# Position-Bias-Model: P(click | rank, rel) = examine(rank) * relevance# Use IPS to debias clicks → train ranker.defips_loss(clicks,ranks,propensity):return-np.mean(clicks/propensity[ranks])