[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,357 @@
|
||||
---
|
||||
id: ai-embedding-strategy-deep
|
||||
title: Embedding Strategy — model / chunk / multi-vector
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, embedding, vibe-coding]
|
||||
tech_stack: { language: "TS / Python", applicable_to: ["AI"] }
|
||||
applied_in: []
|
||||
aliases: [embedding model, OpenAI embedding, voyage, cohere, multi-vector, late chunking, ColBERT]
|
||||
---
|
||||
|
||||
# Embedding Strategy
|
||||
|
||||
> RAG 의 quality 가 embedding 가 큰 lever. **Model 선택, chunk strategy, multi-vector, late chunking, ColBERT**.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- 매 model 의 dimension / quality 다름.
|
||||
- Chunk size 의 trade-off.
|
||||
- Multi-vector = 더 정확.
|
||||
- Late chunking = context 보존.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### Model 선택
|
||||
```
|
||||
OpenAI:
|
||||
- text-embedding-3-small: 1536 dim, $0.02/M token, 좋은 baseline.
|
||||
- text-embedding-3-large: 3072 dim, $0.13/M, 더 정확.
|
||||
|
||||
Voyage:
|
||||
- voyage-3: 1024 dim, $0.06/M, SoTA quality.
|
||||
- voyage-code-3: code-specific.
|
||||
|
||||
Cohere:
|
||||
- embed-english-v3 / embed-multilingual-v3.
|
||||
|
||||
Open:
|
||||
- BGE / e5 / nomic / mxbai.
|
||||
- Self-host = $0 inference.
|
||||
|
||||
→ MTEB leaderboard 참고.
|
||||
```
|
||||
|
||||
### Voyage (가장 quality)
|
||||
```ts
|
||||
import { VoyageAIClient } from 'voyageai';
|
||||
const client = new VoyageAIClient({ apiKey });
|
||||
|
||||
const r = await client.embed({
|
||||
input: ['Hello world'],
|
||||
model: 'voyage-3',
|
||||
});
|
||||
const embedding = r.data[0].embedding;
|
||||
```
|
||||
|
||||
### OpenAI
|
||||
```ts
|
||||
const r = await openai.embeddings.create({
|
||||
model: 'text-embedding-3-small',
|
||||
input: ['Hello world'],
|
||||
dimensions: 256, // optional 줄임 (cost ↓)
|
||||
});
|
||||
```
|
||||
|
||||
→ Matryoshka (truncate dim OK).
|
||||
|
||||
### Self-host (BGE)
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer
|
||||
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
|
||||
embeddings = model.encode(['Hello world'])
|
||||
```
|
||||
|
||||
→ Cost = compute. Quality 좋음.
|
||||
|
||||
### Chunk size
|
||||
```
|
||||
50 token: small, precise, lose context.
|
||||
500 token: balanced.
|
||||
2000 token: more context, less precision.
|
||||
|
||||
→ 200-500 가 sweet.
|
||||
Domain (code, legal) 가 다름.
|
||||
```
|
||||
|
||||
### Recursive chunking
|
||||
```python
|
||||
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
||||
|
||||
splitter = RecursiveCharacterTextSplitter(
|
||||
chunk_size=500,
|
||||
chunk_overlap=50,
|
||||
separators=['\n\n', '\n', '. ', ' '],
|
||||
)
|
||||
```
|
||||
|
||||
→ Boundary 보존.
|
||||
|
||||
### Semantic chunking
|
||||
```python
|
||||
# LLM 가 chunk boundary 결정.
|
||||
# 의미 가까운 sentence 가 같은 chunk.
|
||||
|
||||
# LangChain의 SemanticChunker
|
||||
```
|
||||
|
||||
→ 더 정확. 비싼.
|
||||
|
||||
### Parent document retriever
|
||||
```
|
||||
Small chunk (200 token) → embed.
|
||||
Big chunk (2000 token) = parent.
|
||||
|
||||
Search small → return parent.
|
||||
|
||||
→ Precision (small) + context (big).
|
||||
```
|
||||
|
||||
### Late chunking (modern)
|
||||
```python
|
||||
# 1. Whole document → embed.
|
||||
# 2. Token-level pooling.
|
||||
# 3. Chunk = token range 의 average.
|
||||
|
||||
# → Chunk 의 context = 매 word 가 document 전체 보임.
|
||||
```
|
||||
|
||||
→ Jina / Voyage 의 latest.
|
||||
|
||||
### Multi-vector (ColBERT)
|
||||
```
|
||||
1 doc = N vector (매 token).
|
||||
Search 가 매 query token 의 closest doc token.
|
||||
- 더 정확.
|
||||
- 더 큰 storage.
|
||||
|
||||
→ ColBERTv2, RAGatouille.
|
||||
```
|
||||
|
||||
### Hybrid (sparse + dense)
|
||||
```
|
||||
BM25 (keyword) + embedding (semantic) → RRF.
|
||||
```
|
||||
|
||||
→ [[AI_Hybrid_Search_Patterns]].
|
||||
|
||||
### Quantization
|
||||
```python
|
||||
# float32 → int8 (4x storage ↓)
|
||||
import numpy as np
|
||||
|
||||
def quantize(emb, scale):
|
||||
return np.clip(emb * scale, -127, 127).astype(np.int8)
|
||||
```
|
||||
|
||||
→ Storage / cost ↓. Quality 약간 ↓.
|
||||
|
||||
### Binary quantization
|
||||
```python
|
||||
# float32 → 1 bit (32x ↓)
|
||||
binary_emb = (emb > 0).astype(np.uint8)
|
||||
```
|
||||
|
||||
→ Hamming distance (빠름).
|
||||
질량 안 좋음 가 storage 폭발 시 OK.
|
||||
|
||||
### Rerank (after retrieve)
|
||||
```
|
||||
Embed 가 top-50.
|
||||
Cross-encoder 가 top-5.
|
||||
|
||||
→ Embed 의 weakness 보완.
|
||||
Cohere Rerank, BAAI bge-reranker.
|
||||
```
|
||||
|
||||
### Embed of multiple language
|
||||
```
|
||||
text-embedding-3 가 multilingual.
|
||||
voyage-multilingual-2.
|
||||
BGE-m3.
|
||||
|
||||
→ 1 model 가 모든 language.
|
||||
또는 language 별 model.
|
||||
```
|
||||
|
||||
### Code embedding
|
||||
```
|
||||
voyage-code-3.
|
||||
Jina code embedding.
|
||||
codesage.
|
||||
|
||||
→ Code-specific 가 generic 보다 정확.
|
||||
```
|
||||
|
||||
### Cost comparison
|
||||
```
|
||||
OpenAI 3-small: $0.02 / M token.
|
||||
OpenAI 3-large: $0.13.
|
||||
Voyage 3: $0.06.
|
||||
Cohere v3: $0.10.
|
||||
Self-host: 0$ + GPU rental.
|
||||
|
||||
→ Volume 큰 = self-host.
|
||||
작은 = API.
|
||||
```
|
||||
|
||||
### Embedding cache
|
||||
```ts
|
||||
const key = sha256(text);
|
||||
const cached = await cache.get(key);
|
||||
if (cached) return cached;
|
||||
|
||||
const emb = await embed(text);
|
||||
await cache.set(key, emb);
|
||||
return emb;
|
||||
```
|
||||
|
||||
→ 같은 text 가 1번만.
|
||||
|
||||
### Re-embed (model upgrade)
|
||||
```
|
||||
새 model 가 더 좋음.
|
||||
- 모든 doc 재 embed.
|
||||
- Cost (1M doc × $0.02 / 1M token).
|
||||
- Time (수 시간).
|
||||
|
||||
→ Plan + budget.
|
||||
```
|
||||
|
||||
### Eval
|
||||
```python
|
||||
# MTEB-style
|
||||
queries = [{'q': '...', 'relevant': ['doc1', 'doc5']}]
|
||||
|
||||
for q in queries:
|
||||
results = retrieve(q['q'])
|
||||
recall = compute_recall(results, q['relevant'])
|
||||
```
|
||||
|
||||
### Domain fine-tune
|
||||
```python
|
||||
# Sentence-transformers 의 fine-tune
|
||||
from sentence_transformers import SentenceTransformer, InputExample
|
||||
|
||||
train = [
|
||||
InputExample(texts=['query1', 'doc1'], label=1.0),
|
||||
InputExample(texts=['query1', 'doc2'], label=0.0),
|
||||
]
|
||||
|
||||
model.fit(train_dataloader=dataloader, epochs=3)
|
||||
```
|
||||
|
||||
→ Domain-specific 가 generic 보다 정확.
|
||||
|
||||
### Vector DB choice
|
||||
```
|
||||
pgvector: simple, Postgres 친화.
|
||||
Pinecone: managed.
|
||||
Qdrant: open + 빠름.
|
||||
Weaviate: 큰 features.
|
||||
Chroma: 작은 / dev.
|
||||
Milvus: 큰 scale.
|
||||
LanceDB: serverless friendly.
|
||||
```
|
||||
|
||||
→ [[DB_pgvector_Production]].
|
||||
|
||||
### Multi-tenant embedding
|
||||
```sql
|
||||
SELECT * FROM docs
|
||||
WHERE tenant_id = $1
|
||||
ORDER BY embedding <=> $2
|
||||
LIMIT 10;
|
||||
```
|
||||
|
||||
→ Tenant 별 isolation.
|
||||
|
||||
### Visualization
|
||||
```python
|
||||
# UMAP / t-SNE 가 2D
|
||||
import umap
|
||||
proj = umap.UMAP().fit_transform(embeddings)
|
||||
|
||||
# Plot.
|
||||
```
|
||||
|
||||
→ Cluster visible.
|
||||
|
||||
### Production tips
|
||||
```
|
||||
1. Latest model (Voyage 3, OpenAI 3-large).
|
||||
2. Recursive / late chunking.
|
||||
3. Hybrid search.
|
||||
4. Rerank top-5.
|
||||
5. Cache aggressively.
|
||||
6. Eval (golden set).
|
||||
7. Plan re-embed (model upgrade).
|
||||
```
|
||||
|
||||
### LLM-friendly format
|
||||
```
|
||||
Code:
|
||||
- Function 단위 chunk.
|
||||
- Comment 포함.
|
||||
- File path metadata.
|
||||
|
||||
Docs:
|
||||
- Markdown header 단위.
|
||||
- Section path metadata.
|
||||
|
||||
Data:
|
||||
- Row group (table).
|
||||
- Column metadata.
|
||||
```
|
||||
|
||||
### 함정
|
||||
```
|
||||
- Generic chunk 가 best 가정: domain.
|
||||
- 매 query 가 새 embed: cache.
|
||||
- Model upgrade 무시: stale.
|
||||
- Storage 무시: 1B vector × 1536 dim × 4 byte = 6 TB.
|
||||
- Quantization 무 eval: silent quality ↓.
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 작업 | 추천 |
|
||||
|---|---|
|
||||
| Generic English | OpenAI 3-small |
|
||||
| Quality first | Voyage 3 |
|
||||
| Multilingual | OpenAI 3 / BGE-m3 |
|
||||
| Code | voyage-code-3 |
|
||||
| Self-host | BGE / e5 |
|
||||
| Cost-sensitive | OpenAI dim=256 (truncate) |
|
||||
| Multi-vector | ColBERT / RAGatouille |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **모든 거 large model**: cost.
|
||||
- **No chunking strategy**: bad recall.
|
||||
- **No cache**: repeat cost.
|
||||
- **Model upgrade 안 함**: stale quality.
|
||||
- **No eval**: silent regression.
|
||||
- **Quantize without eval**: quality cliff.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- Voyage 3 / OpenAI 3 가 sweet.
|
||||
- Recursive chunking 가 baseline.
|
||||
- Late chunking + multi-vector 가 modern.
|
||||
- Hybrid + rerank 가 quality jump.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_Embeddings_Comparison]]
|
||||
- [[AI_Custom_Embeddings]]
|
||||
- [[AI_RAG_Production]]
|
||||
Reference in New Issue
Block a user