Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

5.1 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Relevance Feedback

매 한 줄

"매 user judgment (or top-k assumption) 매 활용하여 query 매 reformulate". 매 1971 Rocchio 의 vector-space RF에서 시작, 매 BM25 시대 PRF (Robertson) 매 standard 매 query expansion 기법, 매 2026 dense retrieval 시대에 매 ANCE/HyDE/LLM-RankFusion 형태로 매 evolve — RAG pipelines 의 매 retrieval quality boost lever.

매 핵심

매 종류

Explicit RF: user marks results relevant/not.
Implicit RF: clicks, dwell time, scroll signals.
Pseudo RF (PRF): assume top-k from initial retrieval are relevant.

매 Rocchio (vector space)

q_new = α·q_old + β·(1/|Dr|)Σ_{d∈Dr} d - γ·(1/|Dnr|)Σ_{d∈Dnr} d.
α, β, γ ≥ 0; typical (1, 0.75, 0.15).

매 BM25 + RM3 (probabilistic PRF)

Top-k pseudo-relevant set R.
Term weights via RM (relevance model): p(t|R) ∝ Σ_{d∈R} p(d)p(t|d).
Mix with original query: q' = (1-λ)q + λ·top terms from RM.

매 dense / neural variants

ANCE / Contriever PRF: average top-k embeddings.
HyDE: LLM generates hypothetical answer → embed → retrieve.
LLM rerank then expand: top-k rerank → use rationale to expand.

매 응용

Web search query suggestions.
Enterprise search (legal/medical IR).
RAG: PRF on initial retrieval.
Image search (visual feedback).
Active learning over corpora.

💻 패턴

Rocchio expansion

import numpy as np
def rocchio(q, rel_docs, nonrel_docs, alpha=1.0, beta=0.75, gamma=0.15):
    rel = np.mean(rel_docs, axis=0) if len(rel_docs) else 0
    non = np.mean(nonrel_docs, axis=0) if len(nonrel_docs) else 0
    return alpha*q + beta*rel - gamma*non

RM3 with pyserini

from pyserini.search.lucene import LuceneSearcher
searcher = LuceneSearcher('indexes/msmarco-passage')
searcher.set_bm25(k1=0.82, b=0.68)
searcher.set_rm3(fb_terms=10, fb_docs=10, original_query_weight=0.5)
hits = searcher.search('how does covid spread', k=20)

HyDE (LLM hypothetical doc → embed)

def hyde(query, llm, embedder, retriever, k=10):
    prompt = f"Write a passage that answers: {query}"
    hypo = llm(prompt, max_tokens=200)        # Claude Opus 4.7
    hypo_emb = embedder.encode(hypo)
    return retriever.search(hypo_emb, top_k=k)

Dense PRF (avg top-k embeddings)

def dense_prf(q_emb, retriever, k=10, alpha=0.7):
    init = retriever.search(q_emb, top_k=k)
    top_embs = np.stack([d.embedding for d in init])
    new_q = alpha*q_emb + (1-alpha)*top_embs.mean(axis=0)
    new_q /= np.linalg.norm(new_q)
    return retriever.search(new_q, top_k=k)

LLM-driven query expansion (2026)

def llm_expand(query, llm):
    prompt = f"""Given the search query: "{query}"
Generate 5 alternative phrasings and 5 related technical terms.
Return as JSON: {{"phrasings": [...], "terms": [...]}}"""
    return parse_json(llm(prompt))

Click-model implicit RF

# Position-Bias-Model: P(click | rank, rel) = examine(rank) * relevance
# Use IPS to debias clicks → train ranker.
def ips_loss(clicks, ranks, propensity):
    return -np.mean(clicks / propensity[ranks])

매 결정 기준

상황	Approach
Sparse / BM25 baseline	RM3 (default PRF)
Vector / dense retrieval	Dense PRF or HyDE
Strong LLM available	HyDE / LLM expansion
Have user clicks	IPS click model
One-shot precision	LLM rerank top-50
Ambiguous query	LLM phrasings + multi-query

기본값: BM25 baseline + RM3 (k=10, w=0.5); dense는 HyDE for high-stakes RAG.

🔗 Graph

부모: Information Retrieval · Query-Expansion
변형: Pseudo-Relevance-Feedback
응용: RAG
Adjacent: BM25 · Dense-Retrieval · Reranking · ColBERT

🤖 LLM 활용

언제: HyDE, LLM-driven phrasing/term expansion, rerank-then-expand pipelines, RAG retrieval recall boost. 언제 X: 매 latency-critical search (extra LLM call이 budget 매 초과).

❌ 안티패턴

PRF on noisy top-k: 매 initial retrieval junk → 매 PRF amplifies noise. 매 top-k filtering 필수.
Too many expansion terms: 매 query drift — 매 fb_terms ≤ 10 keep.
Naive HyDE for code: 매 code search 에 매 HyDE 매 종종 hurt — empirical check 매 필요.
Ignoring α/β/γ tuning: Rocchio default 매 항상 best 아님.
Mixing sparse query with dense PRF: 매 score scale incompatible.

🧪 검증 / 중복

Verified (Manning IR book ch.9, Lavrenko & Croft RM, Gao et al "HyDE" 2023).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — Rocchio/RM3/HyDE/dense PRF unified

5.1 KiB Raw Blame History Unescape Escape