[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -0,0 +1,357 @@
---
id: ai-embedding-strategy-deep
title: Embedding Strategy — model / chunk / multi-vector
category: Coding
status: draft
source_trust_level: B
verification_status: conceptual
created_at: 2026-05-09
updated_at: 2026-05-09
tags: [ai, embedding, vibe-coding]
tech_stack: { language: "TS / Python", applicable_to: ["AI"] }
applied_in: []
aliases: [embedding model, OpenAI embedding, voyage, cohere, multi-vector, late chunking, ColBERT]
---
# Embedding Strategy
> RAG 의 quality 가 embedding 가 큰 lever. **Model 선택, chunk strategy, multi-vector, late chunking, ColBERT**.
## 📖 핵심 개념
- 매 model 의 dimension / quality 다름.
- Chunk size 의 trade-off.
- Multi-vector = 더 정확.
- Late chunking = context 보존.
## 💻 코드 패턴
### Model 선택
```
OpenAI:
- text-embedding-3-small: 1536 dim, $0.02/M token, 좋은 baseline.
- text-embedding-3-large: 3072 dim, $0.13/M, 더 정확.
Voyage:
- voyage-3: 1024 dim, $0.06/M, SoTA quality.
- voyage-code-3: code-specific.
Cohere:
- embed-english-v3 / embed-multilingual-v3.
Open:
- BGE / e5 / nomic / mxbai.
- Self-host = $0 inference.
→ MTEB leaderboard 참고.
```
### Voyage (가장 quality)
```ts
import { VoyageAIClient } from 'voyageai';
const client = new VoyageAIClient({ apiKey });
const r = await client.embed({
input: ['Hello world'],
model: 'voyage-3',
});
const embedding = r.data[0].embedding;
```
### OpenAI
```ts
const r = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: ['Hello world'],
dimensions: 256, // optional 줄임 (cost ↓)
});
```
→ Matryoshka (truncate dim OK).
### Self-host (BGE)
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(['Hello world'])
```
→ Cost = compute. Quality 좋음.
### Chunk size
```
50 token: small, precise, lose context.
500 token: balanced.
2000 token: more context, less precision.
→ 200-500 가 sweet.
Domain (code, legal) 가 다름.
```
### Recursive chunking
```python
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=['\n\n', '\n', '. ', ' '],
)
```
→ Boundary 보존.
### Semantic chunking
```python
# LLM 가 chunk boundary 결정.
# 의미 가까운 sentence 가 같은 chunk.
# LangChain의 SemanticChunker
```
→ 더 정확. 비싼.
### Parent document retriever
```
Small chunk (200 token) → embed.
Big chunk (2000 token) = parent.
Search small → return parent.
→ Precision (small) + context (big).
```
### Late chunking (modern)
```python
# 1. Whole document → embed.
# 2. Token-level pooling.
# 3. Chunk = token range 의 average.
# → Chunk 의 context = 매 word 가 document 전체 보임.
```
→ Jina / Voyage 의 latest.
### Multi-vector (ColBERT)
```
1 doc = N vector (매 token).
Search 가 매 query token 의 closest doc token.
- 더 정확.
- 더 큰 storage.
→ ColBERTv2, RAGatouille.
```
### Hybrid (sparse + dense)
```
BM25 (keyword) + embedding (semantic) → RRF.
```
→ [[AI_Hybrid_Search_Patterns]].
### Quantization
```python
# float32 → int8 (4x storage ↓)
import numpy as np
def quantize(emb, scale):
return np.clip(emb * scale, -127, 127).astype(np.int8)
```
→ Storage / cost ↓. Quality 약간 ↓.
### Binary quantization
```python
# float32 → 1 bit (32x ↓)
binary_emb = (emb > 0).astype(np.uint8)
```
→ Hamming distance (빠름).
질량 안 좋음 가 storage 폭발 시 OK.
### Rerank (after retrieve)
```
Embed 가 top-50.
Cross-encoder 가 top-5.
→ Embed 의 weakness 보완.
Cohere Rerank, BAAI bge-reranker.
```
### Embed of multiple language
```
text-embedding-3 가 multilingual.
voyage-multilingual-2.
BGE-m3.
→ 1 model 가 모든 language.
또는 language 별 model.
```
### Code embedding
```
voyage-code-3.
Jina code embedding.
codesage.
→ Code-specific 가 generic 보다 정확.
```
### Cost comparison
```
OpenAI 3-small: $0.02 / M token.
OpenAI 3-large: $0.13.
Voyage 3: $0.06.
Cohere v3: $0.10.
Self-host: 0$ + GPU rental.
→ Volume 큰 = self-host.
작은 = API.
```
### Embedding cache
```ts
const key = sha256(text);
const cached = await cache.get(key);
if (cached) return cached;
const emb = await embed(text);
await cache.set(key, emb);
return emb;
```
→ 같은 text 가 1번만.
### Re-embed (model upgrade)
```
새 model 가 더 좋음.
- 모든 doc 재 embed.
- Cost (1M doc × $0.02 / 1M token).
- Time (수 시간).
→ Plan + budget.
```
### Eval
```python
# MTEB-style
queries = [{'q': '...', 'relevant': ['doc1', 'doc5']}]
for q in queries:
results = retrieve(q['q'])
recall = compute_recall(results, q['relevant'])
```
### Domain fine-tune
```python
# Sentence-transformers 의 fine-tune
from sentence_transformers import SentenceTransformer, InputExample
train = [
InputExample(texts=['query1', 'doc1'], label=1.0),
InputExample(texts=['query1', 'doc2'], label=0.0),
]
model.fit(train_dataloader=dataloader, epochs=3)
```
→ Domain-specific 가 generic 보다 정확.
### Vector DB choice
```
pgvector: simple, Postgres 친화.
Pinecone: managed.
Qdrant: open + 빠름.
Weaviate: 큰 features.
Chroma: 작은 / dev.
Milvus: 큰 scale.
LanceDB: serverless friendly.
```
→ [[DB_pgvector_Production]].
### Multi-tenant embedding
```sql
SELECT * FROM docs
WHERE tenant_id = $1
ORDER BY embedding <=> $2
LIMIT 10;
```
→ Tenant 별 isolation.
### Visualization
```python
# UMAP / t-SNE 가 2D
import umap
proj = umap.UMAP().fit_transform(embeddings)
# Plot.
```
→ Cluster visible.
### Production tips
```
1. Latest model (Voyage 3, OpenAI 3-large).
2. Recursive / late chunking.
3. Hybrid search.
4. Rerank top-5.
5. Cache aggressively.
6. Eval (golden set).
7. Plan re-embed (model upgrade).
```
### LLM-friendly format
```
Code:
- Function 단위 chunk.
- Comment 포함.
- File path metadata.
Docs:
- Markdown header 단위.
- Section path metadata.
Data:
- Row group (table).
- Column metadata.
```
### 함정
```
- Generic chunk 가 best 가정: domain.
- 매 query 가 새 embed: cache.
- Model upgrade 무시: stale.
- Storage 무시: 1B vector × 1536 dim × 4 byte = 6 TB.
- Quantization 무 eval: silent quality ↓.
```
## 🤔 의사결정 기준
| 작업 | 추천 |
|---|---|
| Generic English | OpenAI 3-small |
| Quality first | Voyage 3 |
| Multilingual | OpenAI 3 / BGE-m3 |
| Code | voyage-code-3 |
| Self-host | BGE / e5 |
| Cost-sensitive | OpenAI dim=256 (truncate) |
| Multi-vector | ColBERT / RAGatouille |
## ❌ 안티패턴
- **모든 거 large model**: cost.
- **No chunking strategy**: bad recall.
- **No cache**: repeat cost.
- **Model upgrade 안 함**: stale quality.
- **No eval**: silent regression.
- **Quantize without eval**: quality cliff.
## 🤖 LLM 활용 힌트
- Voyage 3 / OpenAI 3 가 sweet.
- Recursive chunking 가 baseline.
- Late chunking + multi-vector 가 modern.
- Hybrid + rerank 가 quality jump.
## 🔗 관련 문서
- [[AI_Embeddings_Comparison]]
- [[AI_Custom_Embeddings]]
- [[AI_RAG_Production]]