Files
2nd/10_Wiki/Topics/Coding/AI_Embedding_Strategy_Deep.md
T
2026-05-10 22:08:15 +09:00

7.0 KiB
Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-embedding-strategy-deep Embedding Strategy — model / chunk / multi-vector Coding draft B conceptual 2026-05-09 2026-05-09
ai
embedding
vibe-coding
language applicable_to
TS / Python
AI
embedding model
OpenAI embedding
voyage
cohere
multi-vector
late chunking
ColBERT

Embedding Strategy

RAG 의 quality 가 embedding 가 큰 lever. Model 선택, chunk strategy, multi-vector, late chunking, ColBERT.

📖 핵심 개념

  • 매 model 의 dimension / quality 다름.
  • Chunk size 의 trade-off.
  • Multi-vector = 더 정확.
  • Late chunking = context 보존.

💻 코드 패턴

Model 선택

OpenAI:
- text-embedding-3-small: 1536 dim, $0.02/M token, 좋은 baseline.
- text-embedding-3-large: 3072 dim, $0.13/M, 더 정확.

Voyage:
- voyage-3: 1024 dim, $0.06/M, SoTA quality.
- voyage-code-3: code-specific.

Cohere:
- embed-english-v3 / embed-multilingual-v3.

Open:
- BGE / e5 / nomic / mxbai.
- Self-host = $0 inference.

→ MTEB leaderboard 참고.

Voyage (가장 quality)

import { VoyageAIClient } from 'voyageai';
const client = new VoyageAIClient({ apiKey });

const r = await client.embed({
  input: ['Hello world'],
  model: 'voyage-3',
});
const embedding = r.data[0].embedding;

OpenAI

const r = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: ['Hello world'],
  dimensions: 256,   // optional 줄임 (cost ↓)
});

→ Matryoshka (truncate dim OK).

Self-host (BGE)

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
embeddings = model.encode(['Hello world'])

→ Cost = compute. Quality 좋음.

Chunk size

50 token: small, precise, lose context.
500 token: balanced.
2000 token: more context, less precision.

→ 200-500 가 sweet.
Domain (code, legal) 가 다름.

Recursive chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=['\n\n', '\n', '. ', ' '],
)

→ Boundary 보존.

Semantic chunking

# LLM 가 chunk boundary 결정.
# 의미 가까운 sentence 가 같은 chunk.

# LangChain의 SemanticChunker

→ 더 정확. 비싼.

Parent document retriever

Small chunk (200 token) → embed.
Big chunk (2000 token) = parent.

Search small → return parent.

→ Precision (small) + context (big).

Late chunking (modern)

# 1. Whole document → embed.
# 2. Token-level pooling.
# 3. Chunk = token range 의 average.

# → Chunk 의 context = 매 word 가 document 전체 보임.

→ Jina / Voyage 의 latest.

Multi-vector (ColBERT)

1 doc = N vector (매 token).
Search 가 매 query token 의 closest doc token.
- 더 정확.
- 더 큰 storage.

→ ColBERTv2, RAGatouille.

Hybrid (sparse + dense)

BM25 (keyword) + embedding (semantic) → RRF.

AI_Hybrid_Search_Patterns.

Quantization

# float32 → int8 (4x storage ↓)
import numpy as np

def quantize(emb, scale):
    return np.clip(emb * scale, -127, 127).astype(np.int8)

→ Storage / cost ↓. Quality 약간 ↓.

Binary quantization

# float32 → 1 bit (32x ↓)
binary_emb = (emb > 0).astype(np.uint8)

→ Hamming distance (빠름). 질량 안 좋음 가 storage 폭발 시 OK.

Rerank (after retrieve)

Embed 가 top-50.
Cross-encoder 가 top-5.

→ Embed 의 weakness 보완.
Cohere Rerank, BAAI bge-reranker.

Embed of multiple language

text-embedding-3 가 multilingual.
voyage-multilingual-2.
BGE-m3.

→ 1 model 가 모든 language.
또는 language 별 model.

Code embedding

voyage-code-3.
Jina code embedding.
codesage.

→ Code-specific 가 generic 보다 정확.

Cost comparison

OpenAI 3-small: $0.02 / M token.
OpenAI 3-large: $0.13.
Voyage 3: $0.06.
Cohere v3: $0.10.
Self-host: 0$ + GPU rental.

→ Volume 큰 = self-host.
작은 = API.

Embedding cache

const key = sha256(text);
const cached = await cache.get(key);
if (cached) return cached;

const emb = await embed(text);
await cache.set(key, emb);
return emb;

→ 같은 text 가 1번만.

Re-embed (model upgrade)

새 model 가 더 좋음.
- 모든 doc 재 embed.
- Cost (1M doc × $0.02 / 1M token).
- Time (수 시간).

→ Plan + budget.

Eval

# MTEB-style
queries = [{'q': '...', 'relevant': ['doc1', 'doc5']}]

for q in queries:
    results = retrieve(q['q'])
    recall = compute_recall(results, q['relevant'])

Domain fine-tune

# Sentence-transformers 의 fine-tune
from sentence_transformers import SentenceTransformer, InputExample

train = [
    InputExample(texts=['query1', 'doc1'], label=1.0),
    InputExample(texts=['query1', 'doc2'], label=0.0),
]

model.fit(train_dataloader=dataloader, epochs=3)

→ Domain-specific 가 generic 보다 정확.

Vector DB choice

pgvector: simple, Postgres 친화.
Pinecone: managed.
Qdrant: open + 빠름.
Weaviate: 큰 features.
Chroma: 작은 / dev.
Milvus: 큰 scale.
LanceDB: serverless friendly.

DB_pgvector_Production.

Multi-tenant embedding

SELECT * FROM docs
WHERE tenant_id = $1
ORDER BY embedding <=> $2
LIMIT 10;

→ Tenant 별 isolation.

Visualization

# UMAP / t-SNE 가 2D
import umap
proj = umap.UMAP().fit_transform(embeddings)

# Plot.

→ Cluster visible.

Production tips

1. Latest model (Voyage 3, OpenAI 3-large).
2. Recursive / late chunking.
3. Hybrid search.
4. Rerank top-5.
5. Cache aggressively.
6. Eval (golden set).
7. Plan re-embed (model upgrade).

LLM-friendly format

Code:
- Function 단위 chunk.
- Comment 포함.
- File path metadata.

Docs:
- Markdown header 단위.
- Section path metadata.

Data:
- Row group (table).
- Column metadata.

함정

- Generic chunk 가 best 가정: domain.
- 매 query 가 새 embed: cache.
- Model upgrade 무시: stale.
- Storage 무시: 1B vector × 1536 dim × 4 byte = 6 TB.
- Quantization 무 eval: silent quality ↓.

🤔 의사결정 기준

작업 추천
Generic English OpenAI 3-small
Quality first Voyage 3
Multilingual OpenAI 3 / BGE-m3
Code voyage-code-3
Self-host BGE / e5
Cost-sensitive OpenAI dim=256 (truncate)
Multi-vector ColBERT / RAGatouille

안티패턴

  • 모든 거 large model: cost.
  • No chunking strategy: bad recall.
  • No cache: repeat cost.
  • Model upgrade 안 함: stale quality.
  • No eval: silent regression.
  • Quantize without eval: quality cliff.

🤖 LLM 활용 힌트

  • Voyage 3 / OpenAI 3 가 sweet.
  • Recursive chunking 가 baseline.
  • Late chunking + multi-vector 가 modern.
  • Hybrid + rerank 가 quality jump.

🔗 관련 문서