Files
2nd/10_Wiki/Topics/AI_and_ML/Semantic Search.md
T
koriweb 95cd8bb891 feat(wiki): 코드 그라운딩 23문서 + MOC 학습지도 39개
- 코드 그라운딩: 기술 주제 문서의 '적용 사례'에 실제 레포 구현 위치
  (file:line)+커밋 자동 주입 (예: 문서 청킹 전략→connectai/src/retrieval/chunker.ts).
  멱등 마커(CODE-GROUNDING)로 재실행 시 갱신.
- MOC: 39개 클러스터 폴더에 _MOC.md 학습지도 생성(진입점+통찰 주석).
도구: Datacollect/scripts/{code_grounding,moc_generator}.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 18:56:11 +09:00

243 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-semantic-search
title: Semantic Search
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Vector Search, Dense Retrieval, Neural Search, Semantic Search with AI]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [search, retrieval, embeddings, vector-db, rag]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: faiss
---
# Semantic Search
## 매 한 줄
> **"매 query → embedding → ANN nearest neighbors in vector space"**. 매 BM25 매 lexical 한계를 dense retrieval (DPR, ColBERT) 매 극복. 매 2026 production: hybrid (BM25 + dense + reranker), 매 모범: OpenAI text-embedding-3-large, Cohere v3, Voyage-3, BGE-M3, Jina-v3.
## 매 핵심
### 매 Pipeline
1. **Index time**: doc → chunk → embed → vector DB.
2. **Query time**: query → embed → ANN search → (rerank) → results.
3. **Hybrid**: BM25 score + dense score → RRF or weighted.
4. **Rerank**: cross-encoder on top-100 → top-10 (Cohere Rerank, BGE-Reranker).
### 매 Embedding models (2026)
- **OpenAI text-embedding-3-large** (3072d, MRL truncatable).
- **Cohere embed-v3** (multilingual, dot-product).
- **Voyage-3** (state-of-art retrieval).
- **BGE-M3** (open, multi-vector, sparse+dense).
- **Jina-v3** (8k context, MRL).
- **NV-Embed-v2** (NVIDIA, MTEB top).
### 매 ANN algorithms
- **HNSW** (graph): 매 default, fast, high recall.
- **IVF-PQ** (Faiss): 매 huge scale, compressed.
- **DiskANN**: 매 on-disk billion-scale.
- **ScaNN** (Google): 매 best at fixed memory.
### 매 Vector DBs
- **Pinecone** (managed).
- **Weaviate** (open + hybrid built-in).
- **Qdrant** (Rust, fast).
- **Milvus** (large-scale).
- **pgvector** (Postgres).
- **LanceDB** (embedded, columnar).
- **Turbopuffer** (serverless 2024+).
### 매 응용
1. RAG knowledge retrieval.
2. Code search (Cursor, Sourcegraph).
3. E-commerce / product search.
4. Multimodal (CLIP image+text).
## 💻 패턴
### Basic dense retrieval
```python
from openai import OpenAI
import numpy as np
import faiss
client = OpenAI()
def embed(texts):
r = client.embeddings.create(model="text-embedding-3-large", input=texts)
return np.array([d.embedding for d in r.data], dtype="float32")
docs = ["Doc 1 text...", "Doc 2 text...", "..."]
doc_vecs = embed(docs)
index = faiss.IndexHNSWFlat(3072, 32)
faiss.normalize_L2(doc_vecs)
index.add(doc_vecs)
q_vec = embed(["What is X?"])
faiss.normalize_L2(q_vec)
D, I = index.search(q_vec, 10)
print([docs[i] for i in I[0]])
```
### Hybrid (BM25 + dense) with RRF
```python
from rank_bm25 import BM25Okapi
bm25 = BM25Okapi([d.split() for d in docs])
def rrf(rankings, k=60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
def hybrid_search(query, k=10):
bm25_top = np.argsort(-bm25.get_scores(query.split()))[:50]
q_vec = embed([query]); faiss.normalize_L2(q_vec)
_, dense_top = index.search(q_vec, 50)
fused = rrf([bm25_top.tolist(), dense_top[0].tolist()])
return [docs[i] for i, _ in fused[:k]]
```
### Cross-encoder reranking
```python
import cohere
co = cohere.Client()
def rerank(query, candidates, top_n=10):
r = co.rerank(query=query, documents=candidates,
model="rerank-english-v3.0", top_n=top_n)
return [candidates[res.index] for res in r.results]
```
### Chunking with overlap
```python
def chunk_text(text, size=500, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), size - overlap):
chunk = " ".join(words[i:i+size])
chunks.append(chunk)
return chunks
# 매 better: 매 semantic chunker (매 paragraph + heading aware)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "])
```
### MRL truncation (Matryoshka)
```python
# text-embedding-3-large: 3072d, truncatable to 256/512/1024
def embed_mrl(text, dim=512):
full = embed([text])[0]
truncated = full[:dim]
return truncated / np.linalg.norm(truncated)
# 매 6× memory savings, 매 ~95% recall.
```
### ColBERT (multi-vector late interaction)
```python
from colbert.modeling.colbert import ColBERT
# 매 token-level vectors per query+doc; 매 max-sim per query token then sum.
def colbert_score(query_vecs, doc_vecs):
# query_vecs: [Q, d], doc_vecs: [D, d]
sim = query_vecs @ doc_vecs.T # [Q, D]
return sim.max(axis=1).sum() # 매 sum of per-token max
```
### pgvector hybrid (production)
```sql
CREATE TABLE docs (id bigserial, content text, embedding vector(1536),
tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED);
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON docs USING gin (tsv);
-- Hybrid query
WITH dense AS (
SELECT id, 1 - (embedding <=> $1) AS score FROM docs ORDER BY embedding <=> $1 LIMIT 50
), sparse AS (
SELECT id, ts_rank_cd(tsv, websearch_to_tsquery($2)) AS score
FROM docs WHERE tsv @@ websearch_to_tsquery($2) LIMIT 50
)
SELECT id, COALESCE(d.score, 0) * 0.7 + COALESCE(s.score, 0) * 0.3 AS score
FROM dense d FULL OUTER JOIN sparse s USING (id)
ORDER BY score DESC LIMIT 10;
```
### Multimodal CLIP search
```python
import torch
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
proc = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
def embed_image(img):
with torch.no_grad():
return model.get_image_features(**proc(images=img, return_tensors="pt"))
def embed_text(t):
with torch.no_grad():
return model.get_text_features(**proc(text=t, return_tensors="pt"))
# 매 same vector space → cross-modal search.
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Quick prototype | 매 OpenAI embeddings + Faiss/LanceDB |
| Production RAG | 매 hybrid (BM25 + dense) + Cohere rerank |
| Self-host open | 매 BGE-M3 + Qdrant + BGE-reranker |
| Multilingual | 매 BGE-M3, Cohere multilingual, embed-v4 |
| Code search | 매 Voyage-code-3 또는 jina-code-v2 |
| Multimodal | 매 CLIP / SigLIP / Jina-CLIP |
**기본값**: 매 production RAG → hybrid (BM25 + dense) + cross-encoder rerank.
## 🔗 Graph
- 부모: [[Information Retrieval]] · [[Embeddings]]
- 변형: [[Dense Retrieval]] · [[Sparse Retrieval]] · [[Information-Retrieval-IR|Hybrid Search]] · [[ColBERT]]
- 응용: [[RAG]] · [[Recommender Systems]]
- Adjacent: [[BM25]] · [[Cross-Encoder Reranking]] · [[CLIP]]
## 🤖 LLM 활용
**언제**: 매 RAG retrieval, 매 semantic deduplication, 매 cross-lingual search, 매 recommendation.
**언제 X**: 매 exact-match (use BM25), 매 small corpus (<1k docs — 매 LLM-direct 가 simpler), 매 high-precision regex needs.
## ❌ 안티패턴
- **Dense-only**: 매 BM25 매 still wins on rare terms / proper nouns — 매 hybrid.
- **No reranker**: 매 top-10 quality 매 leaves 30% on table.
- **Bad chunking**: 매 fixed-size mid-sentence — 매 use semantic / heading-aware.
- **No metadata filter**: 매 hybrid filter (date/source) before vector search.
- **Cosine without normalize**: 매 silent bug — 매 always normalize L2.
## 🧪 검증 / 중복
- Verified (Karpukhin DPR 2020, Khattab ColBERT 2020, MTEB benchmark, Cohere Rerank docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — hybrid, MRL, ColBERT, pgvector, multimodal |
## 🛠️ 적용 사례 (Applied in summary)
<!-- CODE-GROUNDING:START -->
### 🔎 코드베이스 근거 (자동 추출 — E:\Wiki 레포)
**실제 구현/사용 위치:**
- `connectai/src/features/projectChronicle/guardPrompt.ts:57` — [Omitted long matching line]
_자동 생성: code_grounding.mjs · 재실행 시 갱신됨_
<!-- CODE-GROUNDING:END -->