7.0 KiB
7.0 KiB
id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
| id | title | category | status | source_trust_level | verification_status | created_at | updated_at | tags | tech_stack | applied_in | aliases | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ai-embedding-strategy-deep | Embedding Strategy — model / chunk / multi-vector | Coding | draft | B | conceptual | 2026-05-09 | 2026-05-09 |
|
|
|
Embedding Strategy
RAG 의 quality 가 embedding 가 큰 lever. Model 선택, chunk strategy, multi-vector, late chunking, ColBERT.
📖 핵심 개념
- 매 model 의 dimension / quality 다름.
- Chunk size 의 trade-off.
- Multi-vector = 더 정확.
- Late chunking = context 보존.
💻 코드 패턴
Model 선택
OpenAI:
- text-embedding-3-small: 1536 dim, $0.02/M token, 좋은 baseline.
- text-embedding-3-large: 3072 dim, $0.13/M, 더 정확.
Voyage:
- voyage-3: 1024 dim, $0.06/M, SoTA quality.
- voyage-code-3: code-specific.
Cohere:
- embed-english-v3 / embed-multilingual-v3.
Open:
- BGE / e5 / nomic / mxbai.
- Self-host = $0 inference.
→ MTEB leaderboard 참고.
Voyage (가장 quality)
import { VoyageAIClient } from 'voyageai';
const client = new VoyageAIClient({ apiKey });
const r = await client.embed({
input: ['Hello world'],
model: 'voyage-3',
});
const embedding = r.data[0].embedding;
OpenAI
const r = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: ['Hello world'],
dimensions: 256, // optional 줄임 (cost ↓)
});
→ Matryoshka (truncate dim OK).
Self-host (BGE)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(['Hello world'])
→ Cost = compute. Quality 좋음.
Chunk size
50 token: small, precise, lose context.
500 token: balanced.
2000 token: more context, less precision.
→ 200-500 가 sweet.
Domain (code, legal) 가 다름.
Recursive chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=['\n\n', '\n', '. ', ' '],
)
→ Boundary 보존.
Semantic chunking
# LLM 가 chunk boundary 결정.
# 의미 가까운 sentence 가 같은 chunk.
# LangChain의 SemanticChunker
→ 더 정확. 비싼.
Parent document retriever
Small chunk (200 token) → embed.
Big chunk (2000 token) = parent.
Search small → return parent.
→ Precision (small) + context (big).
Late chunking (modern)
# 1. Whole document → embed.
# 2. Token-level pooling.
# 3. Chunk = token range 의 average.
# → Chunk 의 context = 매 word 가 document 전체 보임.
→ Jina / Voyage 의 latest.
Multi-vector (ColBERT)
1 doc = N vector (매 token).
Search 가 매 query token 의 closest doc token.
- 더 정확.
- 더 큰 storage.
→ ColBERTv2, RAGatouille.
Hybrid (sparse + dense)
BM25 (keyword) + embedding (semantic) → RRF.
Quantization
# float32 → int8 (4x storage ↓)
import numpy as np
def quantize(emb, scale):
return np.clip(emb * scale, -127, 127).astype(np.int8)
→ Storage / cost ↓. Quality 약간 ↓.
Binary quantization
# float32 → 1 bit (32x ↓)
binary_emb = (emb > 0).astype(np.uint8)
→ Hamming distance (빠름). 질량 안 좋음 가 storage 폭발 시 OK.
Rerank (after retrieve)
Embed 가 top-50.
Cross-encoder 가 top-5.
→ Embed 의 weakness 보완.
Cohere Rerank, BAAI bge-reranker.
Embed of multiple language
text-embedding-3 가 multilingual.
voyage-multilingual-2.
BGE-m3.
→ 1 model 가 모든 language.
또는 language 별 model.
Code embedding
voyage-code-3.
Jina code embedding.
codesage.
→ Code-specific 가 generic 보다 정확.
Cost comparison
OpenAI 3-small: $0.02 / M token.
OpenAI 3-large: $0.13.
Voyage 3: $0.06.
Cohere v3: $0.10.
Self-host: 0$ + GPU rental.
→ Volume 큰 = self-host.
작은 = API.
Embedding cache
const key = sha256(text);
const cached = await cache.get(key);
if (cached) return cached;
const emb = await embed(text);
await cache.set(key, emb);
return emb;
→ 같은 text 가 1번만.
Re-embed (model upgrade)
새 model 가 더 좋음.
- 모든 doc 재 embed.
- Cost (1M doc × $0.02 / 1M token).
- Time (수 시간).
→ Plan + budget.
Eval
# MTEB-style
queries = [{'q': '...', 'relevant': ['doc1', 'doc5']}]
for q in queries:
results = retrieve(q['q'])
recall = compute_recall(results, q['relevant'])
Domain fine-tune
# Sentence-transformers 의 fine-tune
from sentence_transformers import SentenceTransformer, InputExample
train = [
InputExample(texts=['query1', 'doc1'], label=1.0),
InputExample(texts=['query1', 'doc2'], label=0.0),
]
model.fit(train_dataloader=dataloader, epochs=3)
→ Domain-specific 가 generic 보다 정확.
Vector DB choice
pgvector: simple, Postgres 친화.
Pinecone: managed.
Qdrant: open + 빠름.
Weaviate: 큰 features.
Chroma: 작은 / dev.
Milvus: 큰 scale.
LanceDB: serverless friendly.
Multi-tenant embedding
SELECT * FROM docs
WHERE tenant_id = $1
ORDER BY embedding <=> $2
LIMIT 10;
→ Tenant 별 isolation.
Visualization
# UMAP / t-SNE 가 2D
import umap
proj = umap.UMAP().fit_transform(embeddings)
# Plot.
→ Cluster visible.
Production tips
1. Latest model (Voyage 3, OpenAI 3-large).
2. Recursive / late chunking.
3. Hybrid search.
4. Rerank top-5.
5. Cache aggressively.
6. Eval (golden set).
7. Plan re-embed (model upgrade).
LLM-friendly format
Code:
- Function 단위 chunk.
- Comment 포함.
- File path metadata.
Docs:
- Markdown header 단위.
- Section path metadata.
Data:
- Row group (table).
- Column metadata.
함정
- Generic chunk 가 best 가정: domain.
- 매 query 가 새 embed: cache.
- Model upgrade 무시: stale.
- Storage 무시: 1B vector × 1536 dim × 4 byte = 6 TB.
- Quantization 무 eval: silent quality ↓.
🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Generic English | OpenAI 3-small |
| Quality first | Voyage 3 |
| Multilingual | OpenAI 3 / BGE-m3 |
| Code | voyage-code-3 |
| Self-host | BGE / e5 |
| Cost-sensitive | OpenAI dim=256 (truncate) |
| Multi-vector | ColBERT / RAGatouille |
❌ 안티패턴
- 모든 거 large model: cost.
- No chunking strategy: bad recall.
- No cache: repeat cost.
- Model upgrade 안 함: stale quality.
- No eval: silent regression.
- Quantize without eval: quality cliff.
🤖 LLM 활용 힌트
- Voyage 3 / OpenAI 3 가 sweet.
- Recursive chunking 가 baseline.
- Late chunking + multi-vector 가 modern.
- Hybrid + rerank 가 quality jump.