--- id: ai-embedding-strategy-deep title: Embedding Strategy — model / chunk / multi-vector category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, embedding, vibe-coding] tech_stack: { language: "TS / Python", applicable_to: ["AI"] } applied_in: [] aliases: [embedding model, OpenAI embedding, voyage, cohere, multi-vector, late chunking, ColBERT] --- # Embedding Strategy > RAG 의 quality 가 embedding 가 큰 lever. **Model 선택, chunk strategy, multi-vector, late chunking, ColBERT**. ## 📖 핵심 개념 - 매 model 의 dimension / quality 다름. - Chunk size 의 trade-off. - Multi-vector = 더 정확. - Late chunking = context 보존. ## 💻 코드 패턴 ### Model 선택 ``` OpenAI: - text-embedding-3-small: 1536 dim, $0.02/M token, 좋은 baseline. - text-embedding-3-large: 3072 dim, $0.13/M, 더 정확. Voyage: - voyage-3: 1024 dim, $0.06/M, SoTA quality. - voyage-code-3: code-specific. Cohere: - embed-english-v3 / embed-multilingual-v3. Open: - BGE / e5 / nomic / mxbai. - Self-host = $0 inference. → MTEB leaderboard 참고. ``` ### Voyage (가장 quality) ```ts import { VoyageAIClient } from 'voyageai'; const client = new VoyageAIClient({ apiKey }); const r = await client.embed({ input: ['Hello world'], model: 'voyage-3', }); const embedding = r.data[0].embedding; ``` ### OpenAI ```ts const r = await openai.embeddings.create({ model: 'text-embedding-3-small', input: ['Hello world'], dimensions: 256, // optional 줄임 (cost ↓) }); ``` → Matryoshka (truncate dim OK). ### Self-host (BGE) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('BAAI/bge-small-en-v1.5') embeddings = model.encode(['Hello world']) ``` → Cost = compute. Quality 좋음. ### Chunk size ``` 50 token: small, precise, lose context. 500 token: balanced. 2000 token: more context, less precision. → 200-500 가 sweet. Domain (code, legal) 가 다름. ``` ### Recursive chunking ```python from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=['\n\n', '\n', '. ', ' '], ) ``` → Boundary 보존. ### Semantic chunking ```python # LLM 가 chunk boundary 결정. # 의미 가까운 sentence 가 같은 chunk. # LangChain의 SemanticChunker ``` → 더 정확. 비싼. ### Parent document retriever ``` Small chunk (200 token) → embed. Big chunk (2000 token) = parent. Search small → return parent. → Precision (small) + context (big). ``` ### Late chunking (modern) ```python # 1. Whole document → embed. # 2. Token-level pooling. # 3. Chunk = token range 의 average. # → Chunk 의 context = 매 word 가 document 전체 보임. ``` → Jina / Voyage 의 latest. ### Multi-vector (ColBERT) ``` 1 doc = N vector (매 token). Search 가 매 query token 의 closest doc token. - 더 정확. - 더 큰 storage. → ColBERTv2, RAGatouille. ``` ### Hybrid (sparse + dense) ``` BM25 (keyword) + embedding (semantic) → RRF. ``` → [[AI_Hybrid_Search_Patterns]]. ### Quantization ```python # float32 → int8 (4x storage ↓) import numpy as np def quantize(emb, scale): return np.clip(emb * scale, -127, 127).astype(np.int8) ``` → Storage / cost ↓. Quality 약간 ↓. ### Binary quantization ```python # float32 → 1 bit (32x ↓) binary_emb = (emb > 0).astype(np.uint8) ``` → Hamming distance (빠름). 질량 안 좋음 가 storage 폭발 시 OK. ### Rerank (after retrieve) ``` Embed 가 top-50. Cross-encoder 가 top-5. → Embed 의 weakness 보완. Cohere Rerank, BAAI bge-reranker. ``` ### Embed of multiple language ``` text-embedding-3 가 multilingual. voyage-multilingual-2. BGE-m3. → 1 model 가 모든 language. 또는 language 별 model. ``` ### Code embedding ``` voyage-code-3. Jina code embedding. codesage. → Code-specific 가 generic 보다 정확. ``` ### Cost comparison ``` OpenAI 3-small: $0.02 / M token. OpenAI 3-large: $0.13. Voyage 3: $0.06. Cohere v3: $0.10. Self-host: 0$ + GPU rental. → Volume 큰 = self-host. 작은 = API. ``` ### Embedding cache ```ts const key = sha256(text); const cached = await cache.get(key); if (cached) return cached; const emb = await embed(text); await cache.set(key, emb); return emb; ``` → 같은 text 가 1번만. ### Re-embed (model upgrade) ``` 새 model 가 더 좋음. - 모든 doc 재 embed. - Cost (1M doc × $0.02 / 1M token). - Time (수 시간). → Plan + budget. ``` ### Eval ```python # MTEB-style queries = [{'q': '...', 'relevant': ['doc1', 'doc5']}] for q in queries: results = retrieve(q['q']) recall = compute_recall(results, q['relevant']) ``` ### Domain fine-tune ```python # Sentence-transformers 의 fine-tune from sentence_transformers import SentenceTransformer, InputExample train = [ InputExample(texts=['query1', 'doc1'], label=1.0), InputExample(texts=['query1', 'doc2'], label=0.0), ] model.fit(train_dataloader=dataloader, epochs=3) ``` → Domain-specific 가 generic 보다 정확. ### Vector DB choice ``` pgvector: simple, Postgres 친화. Pinecone: managed. Qdrant: open + 빠름. Weaviate: 큰 features. Chroma: 작은 / dev. Milvus: 큰 scale. LanceDB: serverless friendly. ``` → [[DB_pgvector_Production]]. ### Multi-tenant embedding ```sql SELECT * FROM docs WHERE tenant_id = $1 ORDER BY embedding <=> $2 LIMIT 10; ``` → Tenant 별 isolation. ### Visualization ```python # UMAP / t-SNE 가 2D import umap proj = umap.UMAP().fit_transform(embeddings) # Plot. ``` → Cluster visible. ### Production tips ``` 1. Latest model (Voyage 3, OpenAI 3-large). 2. Recursive / late chunking. 3. Hybrid search. 4. Rerank top-5. 5. Cache aggressively. 6. Eval (golden set). 7. Plan re-embed (model upgrade). ``` ### LLM-friendly format ``` Code: - Function 단위 chunk. - Comment 포함. - File path metadata. Docs: - Markdown header 단위. - Section path metadata. Data: - Row group (table). - Column metadata. ``` ### 함정 ``` - Generic chunk 가 best 가정: domain. - 매 query 가 새 embed: cache. - Model upgrade 무시: stale. - Storage 무시: 1B vector × 1536 dim × 4 byte = 6 TB. - Quantization 무 eval: silent quality ↓. ``` ## 🤔 의사결정 기준 | 작업 | 추천 | |---|---| | Generic English | OpenAI 3-small | | Quality first | Voyage 3 | | Multilingual | OpenAI 3 / BGE-m3 | | Code | voyage-code-3 | | Self-host | BGE / e5 | | Cost-sensitive | OpenAI dim=256 (truncate) | | Multi-vector | ColBERT / RAGatouille | ## ❌ 안티패턴 - **모든 거 large model**: cost. - **No chunking strategy**: bad recall. - **No cache**: repeat cost. - **Model upgrade 안 함**: stale quality. - **No eval**: silent regression. - **Quantize without eval**: quality cliff. ## 🤖 LLM 활용 힌트 - Voyage 3 / OpenAI 3 가 sweet. - Recursive chunking 가 baseline. - Late chunking + multi-vector 가 modern. - Hybrid + rerank 가 quality jump. ## 🔗 관련 문서 - [[AI_Embeddings_Comparison]] - [[AI_Custom_Embeddings]] - [[AI_RAG_Production]]