f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
252 lines
7.6 KiB
Markdown
252 lines
7.6 KiB
Markdown
---
|
|
id: wiki-2026-0508-search-optimization
|
|
title: Search Optimization
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Search Tuning, Retrieval Optimization, Hybrid Search]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [search, retrieval, bm25, vector, hybrid, rag]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: Elasticsearch + pgvector
|
|
---
|
|
|
|
# Search Optimization
|
|
|
|
## 매 한 줄
|
|
> **"매 search 의 quality 는 매 lexical(BM25) + semantic(vector) hybrid + reranker 의 stack — 매 single signal 의 X"**. 매 origin 은 1970s tf-idf, 1994 BM25 (Robertson); 매 modern state 는 BM25F + dense vector (ColBERT/E5/Cohere v3.5) + cross-encoder rerank, 매 RAG 의 retrieval layer.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 search stack (매 2026 modern)
|
|
- **Lexical**: BM25 (Elasticsearch, OpenSearch, Tantivy) — 매 exact term, rare token, code.
|
|
- **Dense vector**: bi-encoder (E5-large, Cohere embed-v3.5, OpenAI 3-large) — 매 semantic match.
|
|
- **Sparse-learned**: SPLADE — 매 lexical + learned weight.
|
|
- **Hybrid fusion**: RRF (Reciprocal Rank Fusion) or weighted score sum.
|
|
- **Reranker**: cross-encoder (Cohere rerank-3.5, BGE-reranker-v2) — 매 top-50 → top-10.
|
|
- **Query understanding**: LLM rewrite, HyDE, multi-query expansion.
|
|
|
|
### 매 응용
|
|
1. Site search (e-commerce, docs).
|
|
2. RAG retrieval.
|
|
3. Code search (GitHub).
|
|
4. Internal knowledge search.
|
|
|
|
## 💻 패턴
|
|
|
|
### 매 BM25 (Elasticsearch 9, 매 tuned)
|
|
```json
|
|
PUT /products
|
|
{
|
|
"settings": {
|
|
"similarity": {
|
|
"default": {
|
|
"type": "BM25",
|
|
"k1": 1.2,
|
|
"b": 0.75
|
|
}
|
|
}
|
|
},
|
|
"mappings": {
|
|
"properties": {
|
|
"title": { "type": "text", "boost": 3.0 },
|
|
"description": { "type": "text" },
|
|
"tags": { "type": "keyword" },
|
|
"embedding": { "type": "dense_vector", "dims": 1024, "similarity": "cosine" }
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 매 hybrid query (RRF, ES 9 native)
|
|
```json
|
|
GET /products/_search
|
|
{
|
|
"retriever": {
|
|
"rrf": {
|
|
"retrievers": [
|
|
{ "standard": {
|
|
"query": { "multi_match": {
|
|
"query": "wireless earbuds noise cancel",
|
|
"fields": ["title^3", "description"]
|
|
}}
|
|
}},
|
|
{ "knn": {
|
|
"field": "embedding",
|
|
"query_vector_builder": {
|
|
"text_embedding": {
|
|
"model_id": "cohere-embed-v3-5",
|
|
"model_text": "wireless earbuds noise cancel"
|
|
}
|
|
},
|
|
"k": 50, "num_candidates": 200
|
|
}}
|
|
],
|
|
"rank_window_size": 100,
|
|
"rank_constant": 60
|
|
}
|
|
},
|
|
"size": 10
|
|
}
|
|
```
|
|
|
|
### 매 BM25 tuning (매 corpus 별 k1/b)
|
|
```python
|
|
# 매 short corpus (titles): k1=1.2, b=0.5 (매 length penalty 약하게)
|
|
# 매 long docs (articles): k1=1.5, b=0.75 (매 default)
|
|
# 매 code search: k1=2.0, b=0.0 (매 length 무관)
|
|
# 매 grid search 매 NDCG@10 으로 tune
|
|
|
|
from rank_bm25 import BM25Okapi
|
|
import numpy as np
|
|
|
|
def grid_search(corpus, queries, judgments):
|
|
best = (None, -1)
|
|
for k1 in [0.8, 1.0, 1.2, 1.5, 2.0]:
|
|
for b in [0.0, 0.25, 0.5, 0.75, 1.0]:
|
|
bm25 = BM25Okapi(corpus, k1=k1, b=b)
|
|
ndcg = evaluate(bm25, queries, judgments)
|
|
if ndcg > best[1]:
|
|
best = ((k1, b), ndcg)
|
|
return best
|
|
```
|
|
|
|
### 매 cross-encoder rerank (Cohere v3.5)
|
|
```python
|
|
import cohere
|
|
co = cohere.ClientV2()
|
|
|
|
# 매 stage 1: hybrid retrieve top 50
|
|
candidates = hybrid_search(query, k=50)
|
|
|
|
# 매 stage 2: rerank to top 10
|
|
resp = co.rerank(
|
|
model="rerank-v3.5",
|
|
query=query,
|
|
documents=[c.text for c in candidates],
|
|
top_n=10,
|
|
)
|
|
top10 = [candidates[r.index] for r in resp.results]
|
|
```
|
|
|
|
### 매 HyDE (Hypothetical Document Embedding)
|
|
```python
|
|
import anthropic
|
|
client = anthropic.Anthropic()
|
|
|
|
def hyde_query(question: str) -> str:
|
|
"""매 question 을 hypothetical answer 로 변환 → 매 그것 을 embed."""
|
|
msg = client.messages.create(
|
|
model="claude-haiku-4-5",
|
|
max_tokens=256,
|
|
messages=[{"role": "user", "content":
|
|
f"Write a 3-sentence hypothetical answer to: {question}"}],
|
|
)
|
|
return msg.content[0].text
|
|
|
|
# 매 query embedding 의 quality 향상 — 매 query-doc length asymmetry 완화
|
|
hypothetical = hyde_query("how does pgvector handle 1024-dim embeddings?")
|
|
emb = embed(hypothetical)
|
|
results = vector_search(emb)
|
|
```
|
|
|
|
### 매 multi-query expansion (매 LLM)
|
|
```python
|
|
def expand_query(q: str) -> list[str]:
|
|
msg = client.messages.create(
|
|
model="claude-haiku-4-5",
|
|
max_tokens=256,
|
|
messages=[{"role": "user", "content":
|
|
f"Generate 3 alternative phrasings for search:\n{q}\n"
|
|
"Return one per line."}],
|
|
)
|
|
return [q] + msg.content[0].text.splitlines()
|
|
|
|
# 매 매 phrasing 으로 search → RRF merge
|
|
queries = expand_query("how to ship a model fast")
|
|
all_hits = [search(q) for q in queries]
|
|
final = rrf_merge(all_hits)
|
|
```
|
|
|
|
### 매 pgvector hybrid (Postgres 17)
|
|
```sql
|
|
-- 매 BM25 (pg_search ext) + vector hybrid
|
|
WITH lexical AS (
|
|
SELECT id, paradedb.score(id) AS s
|
|
FROM docs
|
|
WHERE id @@@ 'description:earbuds'
|
|
ORDER BY s DESC LIMIT 50
|
|
),
|
|
semantic AS (
|
|
SELECT id, 1 - (embedding <=> $1::vector) AS s
|
|
FROM docs
|
|
ORDER BY embedding <=> $1::vector LIMIT 50
|
|
)
|
|
SELECT id,
|
|
COALESCE(1.0/(60 + l.rk), 0) + COALESCE(1.0/(60 + s.rk), 0) AS rrf_score
|
|
FROM (SELECT id, ROW_NUMBER() OVER (ORDER BY s DESC) rk FROM lexical) l
|
|
FULL OUTER JOIN
|
|
(SELECT id, ROW_NUMBER() OVER (ORDER BY s DESC) rk FROM semantic) s
|
|
USING (id)
|
|
ORDER BY rrf_score DESC LIMIT 10;
|
|
```
|
|
|
|
### 매 evaluation (NDCG@10, 매 judgment list)
|
|
```python
|
|
import numpy as np
|
|
|
|
def dcg(rels):
|
|
return sum(r / np.log2(i + 2) for i, r in enumerate(rels))
|
|
|
|
def ndcg(predicted_ids, judgments, k=10):
|
|
rels = [judgments.get(pid, 0) for pid in predicted_ids[:k]]
|
|
ideal = sorted(judgments.values(), reverse=True)[:k]
|
|
return dcg(rels) / dcg(ideal) if dcg(ideal) > 0 else 0
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| 매 keyword-heavy (code, IDs) | BM25 dominant, vector secondary |
|
|
| 매 semantic (NL question) | vector dominant + BM25 floor |
|
|
| 매 mixed (e-commerce) | hybrid RRF + cross-encoder rerank |
|
|
| 매 high-precision top-3 | hybrid → cross-encoder rerank |
|
|
| 매 query 가 짧음/모호 | LLM expand + HyDE |
|
|
| 매 latency-critical (<50ms) | BM25 only or pre-computed embeddings |
|
|
|
|
**기본값**: hybrid (BM25 + dense) + Cohere rerank-v3.5 top-10 + LLM query expansion 옵션.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Information Retrieval]] · [[RAG]]
|
|
- 변형: [[BM25]] · [[Vector Search]] · [[Information-Retrieval-IR|Hybrid Search]] · [[Reranker]]
|
|
- 응용: [[Semantic Search]]
|
|
- Adjacent: [[Embeddings]] · [[ColBERT]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 query expansion, HyDE, query rewrite. 매 reranker prompt-style. 매 result summarization (RAG).
|
|
**언제 X**: 매 retrieval 자체 — 매 vector + BM25 가 더 cheap/fast. 매 LLM-as-retriever 의 latency 비합리.
|
|
|
|
## ❌ 안티패턴
|
|
- **Vector-only search**: 매 exact term (UUID, error code) 매 miss.
|
|
- **No reranker**: 매 top-50 retrieval 의 noise → top-10 quality 저하.
|
|
- **Default BM25 params**: 매 corpus 매 다름 — 매 tune.
|
|
- **No eval set**: 매 judgment 없이 tune → 매 vibe-driven.
|
|
- **Embedding drift**: 매 model upgrade 시 reindex 안 함.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Robertson & Zaragoza "BM25 and Beyond" 2009, BEIR benchmark, Cohere/Anthropic 2026 docs, Pinecone "Hybrid Search").
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — BM25 + vector hybrid + RRF + Cohere rerank-v3.5 + HyDE |
|