[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,348 @@
|
||||
---
|
||||
id: ai-hybrid-search-patterns
|
||||
title: Hybrid Search — vector + BM25 + rerank
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, search, rag, vibe-coding]
|
||||
tech_stack: { language: "TS / Python", applicable_to: ["Backend", "AI"] }
|
||||
applied_in: []
|
||||
aliases: [hybrid search, BM25, vector search, rerank, RRF, reciprocal rank fusion, sparse, dense]
|
||||
---
|
||||
|
||||
# Hybrid Search
|
||||
|
||||
> Vector 만 = 의미 OK, 정확 keyword 약함. **Vector (dense) + BM25 (sparse) + reranker** 조합 — 가장 robust. RRF / weighted / cross-encoder.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- Sparse (BM25): 단어 매칭 — 정확.
|
||||
- Dense (vector): 의미 매칭 — 동의어.
|
||||
- Hybrid: 둘 다. RRF 또는 weighted.
|
||||
- Reranker: top-K 후 LLM / cross-encoder 가 다시 정렬.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### BM25 (단순 keyword)
|
||||
```ts
|
||||
// elasticlunr / lunr / minisearch / TS-native
|
||||
import MiniSearch from 'minisearch';
|
||||
|
||||
const ms = new MiniSearch({
|
||||
fields: ['title', 'body'],
|
||||
storeFields: ['id'],
|
||||
});
|
||||
|
||||
ms.addAll(documents);
|
||||
const results = ms.search('user authentication');
|
||||
```
|
||||
|
||||
→ Stem + tf-idf + BM25 score.
|
||||
|
||||
### Vector (Postgres pgvector)
|
||||
```sql
|
||||
CREATE TABLE docs (
|
||||
id text PRIMARY KEY,
|
||||
text text,
|
||||
embedding vector(1536)
|
||||
);
|
||||
|
||||
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);
|
||||
```
|
||||
|
||||
```ts
|
||||
const queryEmb = await embed(query);
|
||||
const r = await sql`
|
||||
SELECT id, text, 1 - (embedding <=> ${queryEmb}) AS score
|
||||
FROM docs
|
||||
ORDER BY embedding <=> ${queryEmb}
|
||||
LIMIT 50
|
||||
`;
|
||||
```
|
||||
|
||||
### Hybrid (RRF — Reciprocal Rank Fusion)
|
||||
```ts
|
||||
function rrf<T extends { id: string }>(
|
||||
ranked: T[][],
|
||||
k: number = 60
|
||||
): T[] {
|
||||
const scores = new Map<string, number>();
|
||||
const docs = new Map<string, T>();
|
||||
|
||||
for (const list of ranked) {
|
||||
list.forEach((doc, rank) => {
|
||||
scores.set(doc.id, (scores.get(doc.id) ?? 0) + 1 / (k + rank + 1));
|
||||
docs.set(doc.id, doc);
|
||||
});
|
||||
}
|
||||
|
||||
return [...scores.entries()]
|
||||
.sort((a, b) => b[1] - a[1])
|
||||
.map(([id]) => docs.get(id)!);
|
||||
}
|
||||
|
||||
// 사용
|
||||
const bm25Results = await bm25Search(q, 50);
|
||||
const vecResults = await vectorSearch(q, 50);
|
||||
const fused = rrf([bm25Results, vecResults]).slice(0, 20);
|
||||
```
|
||||
|
||||
→ Rank 기반 → score scale 다름 OK.
|
||||
|
||||
### Weighted hybrid (score 직접 합)
|
||||
```ts
|
||||
function weighted(bm25: ScoredDoc[], vec: ScoredDoc[], alpha: number = 0.5) {
|
||||
// Normalize scores [0, 1]
|
||||
const normBM = normalize(bm25);
|
||||
const normVec = normalize(vec);
|
||||
|
||||
const merged = new Map<string, number>();
|
||||
for (const d of normBM) merged.set(d.id, (merged.get(d.id) ?? 0) + (1 - alpha) * d.score);
|
||||
for (const d of normVec) merged.set(d.id, (merged.get(d.id) ?? 0) + alpha * d.score);
|
||||
|
||||
return [...merged.entries()].sort((a, b) => b[1] - a[1]);
|
||||
}
|
||||
```
|
||||
|
||||
→ Alpha tuning. 0.5 가 default.
|
||||
|
||||
### Postgres hybrid
|
||||
```sql
|
||||
WITH bm25 AS (
|
||||
SELECT id, ts_rank(tsv, query) AS score
|
||||
FROM docs, plainto_tsquery('english', $1) query
|
||||
WHERE tsv @@ query
|
||||
ORDER BY score DESC LIMIT 50
|
||||
),
|
||||
vec AS (
|
||||
SELECT id, 1 - (embedding <=> $2) AS score
|
||||
FROM docs
|
||||
ORDER BY embedding <=> $2 LIMIT 50
|
||||
)
|
||||
SELECT id, COALESCE(bm25.score, 0) * 0.4 + COALESCE(vec.score, 0) * 0.6 AS score
|
||||
FROM bm25 FULL OUTER JOIN vec USING (id)
|
||||
ORDER BY score DESC LIMIT 20;
|
||||
```
|
||||
|
||||
### Reranker (cross-encoder)
|
||||
```python
|
||||
from sentence_transformers import CrossEncoder
|
||||
|
||||
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
|
||||
|
||||
candidates = hybrid_search(query, k=50)
|
||||
pairs = [(query, d.text) for d in candidates]
|
||||
scores = reranker.predict(pairs)
|
||||
|
||||
reranked = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10]
|
||||
```
|
||||
|
||||
→ Cross-encoder = 정밀 (큰 cost). Top-50 → top-10.
|
||||
|
||||
### Cohere rerank API
|
||||
```ts
|
||||
import { CohereClient } from 'cohere-ai';
|
||||
const cohere = new CohereClient({ token });
|
||||
|
||||
const r = await cohere.rerank({
|
||||
query,
|
||||
documents: candidates.map(c => c.text),
|
||||
topN: 10,
|
||||
model: 'rerank-english-v3.0',
|
||||
});
|
||||
```
|
||||
|
||||
→ Managed reranker.
|
||||
|
||||
### LLM rerank (작은 model)
|
||||
```ts
|
||||
const prompt = `
|
||||
Rate each document's relevance to the query (0-10).
|
||||
|
||||
Query: ${query}
|
||||
|
||||
${candidates.map((c, i) => `[${i}] ${c.text}`).join('\n\n')}
|
||||
|
||||
Output JSON: {"scores": [...]}
|
||||
`;
|
||||
|
||||
const r = await llm.complete({ prompt, model: 'haiku' });
|
||||
const { scores } = JSON.parse(r.text);
|
||||
const reranked = candidates.map((c, i) => ({ ...c, score: scores[i] }))
|
||||
.sort((a, b) => b.score - a.score);
|
||||
```
|
||||
|
||||
→ 작은 LLM (haiku, gpt-4o-mini) 가 cheap rerank.
|
||||
|
||||
### Query expansion
|
||||
```ts
|
||||
// LLM 가 query 확장
|
||||
const expanded = await llm.complete({
|
||||
prompt: `Generate 3 alternative phrasings: "${query}"`,
|
||||
});
|
||||
const queries = [query, ...expanded.split('\n')];
|
||||
|
||||
// 각 query 검색 + 합치기
|
||||
const all = await Promise.all(queries.map(q => search(q, 20)));
|
||||
const fused = rrf(all);
|
||||
```
|
||||
|
||||
→ "user signin" → "login" / "auth" / "sign in".
|
||||
|
||||
### HyDE (Hypothetical Document Embedding)
|
||||
```ts
|
||||
// LLM 가 가짜 답 생성 → embed → 검색
|
||||
const hypothetical = await llm.complete({
|
||||
prompt: `Generate a detailed answer for: ${query}`,
|
||||
});
|
||||
const emb = await embed(hypothetical);
|
||||
const results = await vectorSearch(emb, 20);
|
||||
```
|
||||
|
||||
→ 실제 답 vs 가짜 답 — 의미 가까우니 검색 좋음.
|
||||
|
||||
### Multi-vector (1 doc → 여러 embedding)
|
||||
```ts
|
||||
// Section 별 / sentence 별 embed
|
||||
const sections = doc.split(/\n\n/);
|
||||
const embeds = await Promise.all(sections.map(s => embed(s)));
|
||||
embeds.forEach((emb, i) => sql`INSERT INTO chunks (doc_id, idx, text, emb) VALUES (${doc.id}, ${i}, ${sections[i]}, ${emb})`);
|
||||
```
|
||||
|
||||
→ Doc 의 1 section 가 hit → 그 doc 가 결과.
|
||||
|
||||
### Fusion in RAG pipeline
|
||||
```
|
||||
Query
|
||||
├→ BM25 (sparse) top-50
|
||||
├→ Vector (dense) top-50
|
||||
├→ Optional: HyDE → vector top-50
|
||||
└→ RRF fuse → top-20
|
||||
└→ Reranker → top-5
|
||||
└→ LLM context
|
||||
```
|
||||
|
||||
### Filtering (metadata)
|
||||
```sql
|
||||
SELECT * FROM docs
|
||||
WHERE category = 'engineering'
|
||||
AND created_at > '2026-01-01'
|
||||
ORDER BY embedding <=> $1
|
||||
LIMIT 20;
|
||||
```
|
||||
|
||||
→ Vector + filter (pre-filter or post).
|
||||
|
||||
### Date / source weight
|
||||
```ts
|
||||
function dateBoost(score: number, daysOld: number): number {
|
||||
const decay = Math.exp(-daysOld / 365);
|
||||
return score * (0.5 + 0.5 * decay);
|
||||
}
|
||||
```
|
||||
|
||||
→ 최신 doc 우대.
|
||||
|
||||
### A/B test
|
||||
```ts
|
||||
// 사용자 query → 두 시스템
|
||||
const A = await search(q, 10);
|
||||
const B = await searchHybrid(q, 10);
|
||||
|
||||
// CTR / dwell time / 만족도 비교
|
||||
log({ user, q, A_clicked: ..., B_clicked: ... });
|
||||
```
|
||||
|
||||
### MTEB benchmark
|
||||
```
|
||||
모델 의 quality 비교:
|
||||
- BGE / e5 / Cohere embed-v3 / text-embedding-3 / Voyage
|
||||
|
||||
→ MTEB leaderboard 참고.
|
||||
```
|
||||
|
||||
### Search-as-a-service
|
||||
```
|
||||
- Algolia: managed BM25 + vector hybrid
|
||||
- Typesense: open source
|
||||
- Meilisearch: simple
|
||||
- Vespa: 가장 강력 + 복잡
|
||||
- Weaviate: vector + hybrid
|
||||
- Pinecone + reranker
|
||||
- Elastic: BM25 + dense
|
||||
```
|
||||
|
||||
### LLM 친화 답
|
||||
```ts
|
||||
const prompt = `
|
||||
Answer based ONLY on context. Cite [1], [2].
|
||||
|
||||
Context:
|
||||
[1] ${docs[0].text}
|
||||
[2] ${docs[1].text}
|
||||
|
||||
Question: ${query}
|
||||
|
||||
Answer:
|
||||
`;
|
||||
```
|
||||
|
||||
→ Hybrid + rerank 가 큰 noise 제거.
|
||||
|
||||
### Eval
|
||||
```python
|
||||
# Recall@K
|
||||
def recall_at_k(predicted, relevant, k):
|
||||
return len(set(predicted[:k]) & set(relevant)) / len(relevant)
|
||||
|
||||
# MRR (Mean Reciprocal Rank)
|
||||
def mrr(predictions, relevant):
|
||||
for i, p in enumerate(predictions):
|
||||
if p in relevant:
|
||||
return 1 / (i + 1)
|
||||
return 0
|
||||
|
||||
# nDCG (가장 표준)
|
||||
```
|
||||
|
||||
### Cost
|
||||
```
|
||||
BM25: cheap (in-DB).
|
||||
Vector: $$ (embedding + index).
|
||||
Reranker: $$$ per call.
|
||||
|
||||
→ 적게 retrieve (top-10) + rerank.
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 상황 | 추천 |
|
||||
|---|---|
|
||||
| 작은 / 단순 search | BM25 만 |
|
||||
| 의미 / 동의어 중요 | Vector |
|
||||
| 일반 production | Hybrid (RRF) |
|
||||
| 정확도 최우선 | Hybrid + rerank |
|
||||
| Long-form Q&A | HyDE + hybrid + rerank |
|
||||
| Real-time | BM25 + cache |
|
||||
| Code search | BM25 + vector + filter (lang) |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **Vector 만 사용**: keyword 정확 약함 (UUID, 코드).
|
||||
- **BM25 만 사용**: 의미 잃음 (login = signin).
|
||||
- **모든 거 rerank**: cost 폭발 — top-50 만.
|
||||
- **Score 정규화 안 함**: weighted 의미 X.
|
||||
- **Chunk 없이 큰 doc**: 검색 약함.
|
||||
- **Filter 후처리**: 효율 X.
|
||||
- **Eval 없음**: tune 못 함.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- RRF 가 score scale 무관 simple.
|
||||
- Reranker (cross-encoder / Cohere) = 큰 quality jump.
|
||||
- HyDE 가 trivial Q→A gap 닫음.
|
||||
- BM25 + Vector + Rerank = canonical.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_RAG_Advanced]]
|
||||
- [[DB_pgvector_Production]]
|
||||
- [[DB_Full_Text_Search]]
|
||||
Reference in New Issue
Block a user