--- id: ai-custom-embeddings title: Custom Embeddings — Fine-tune / Domain-specific category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, embeddings, fine-tune, vibe-coding] tech_stack: { language: "Python / TS", applicable_to: ["Backend"] } applied_in: [] aliases: [embedding fine-tune, domain embeddings, sentence transformers, BGE, contrastive learning] --- # Custom Embeddings > 일반 embedding 가 domain (legal, medical, code) 에 약함. **Domain-specific fine-tune 또는 dedicated model**. Sentence Transformers, BGE, Voyage, Cohere. ## 📖 핵심 개념 - General: 일반 web text — 도메인 약함. - Domain: legal / code / medical etc. - Fine-tune: pair-based contrastive learning. - Reranker: 다른 task — embedding 후 정밀. ## 💻 코드 패턴 ### When to fine-tune ``` 일반 embedding 가 OK: - Web content - General Q&A - 일반 search Custom 가치: - Legal document - Medical records - Code retrieval - 회사 jargon / abbreviations - Multi-language (특정 lang) - Domain (e-commerce, real estate) ``` ### Sentence Transformers (fine-tune) ```python from sentence_transformers import SentenceTransformer, InputExample, losses from torch.utils.data import DataLoader # Base model model = SentenceTransformer('BAAI/bge-base-en-v1.5') # Training data: similar pairs train_examples = [ InputExample(texts=['Q: refund policy', 'A: We offer 30 day refunds for...'], label=0.9), InputExample(texts=['Q: refund', 'A: We offer 30 day refunds for...'], label=0.8), InputExample(texts=['Q: refund', 'A: Today is sunny'], label=0.0), # negative ] train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) train_loss = losses.CosineSimilarityLoss(model) model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100, output_path='./domain-embeddings', ) ``` ### Triplet loss (positive / negative) ```python from sentence_transformers import InputExample, losses train_examples = [ InputExample(texts=[ 'How to refund?', # anchor 'Refund policy: 30 days...', # positive 'Today is sunny', # negative ]), ] train_loss = losses.TripletLoss(model=model) ``` ### Pair generation (LLM 으로) ```python async def generate_pairs(documents): pairs = [] for doc in documents: # LLM 가 이 doc 의 query 생성 queries = await llm.generate(f"Generate 3 user queries that this answers:\n{doc}") for q in queries: pairs.append((q, doc, 1.0)) # positive # Random negative random_doc = random.choice(documents) pairs.append((queries[0], random_doc, 0.0)) # negative (가능 — sometimes positive) return pairs ``` → Synthetic training data. ### Hard negative mining ```python # Random negative = easy. # Better: similar but wrong = hard negative. for query, positive_doc in queries: # 일반 embedding 로 top 10 검색 top_10 = embed_search(query, k=10) # Positive 가 top_10 에 있다면 — 다른 docs = hard negatives for doc in top_10: if doc != positive_doc: pairs.append((query, doc, 0.0)) ``` → 더 좋은 fine-tune. ### Evaluation ```python from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator evaluator = EmbeddingSimilarityEvaluator.from_input_examples( test_examples, name='domain-test', ) # Evaluator 가 model 에 적용 score = evaluator(model, output_path='./eval') print(f'Similarity score: {score}') ``` ```python # Top-K accuracy def evaluate(model, queries, docs, ground_truth): correct = 0 for q, true_doc in zip(queries, ground_truth): embeddings = model.encode([q] + docs) scores = cosine_similarity(embeddings[0], embeddings[1:]) top_k = np.argsort(scores)[-10:] if true_doc in [docs[i] for i in top_k]: correct += 1 return correct / len(queries) ``` ### Domain-specific models (off-the-shelf) ``` Code: - microsoft/codebert-base - jinaai/jina-embeddings-v2-base-code Legal: - nlpaueb/legal-bert-base-uncased Medical: - emilyalsentzer/Bio_ClinicalBERT - microsoft/BiomedNLP-PubMedBERT Multi-language: - BAAI/bge-m3 - intfloat/multilingual-e5-large ``` → Fine-tune 전 domain model 사용. ### Voyage AI (best general) ```ts import { VoyageAIClient } from 'voyageai'; const voyage = new VoyageAIClient({ apiKey }); // General const r = await voyage.embed({ model: 'voyage-3.5', input: ['text1', 'text2'], }); // Code const r = await voyage.embed({ model: 'voyage-code-3', // code-specific input: ['function ...', 'class ...'], }); ``` → General + domain options. ### Cohere (multilingual) ```ts const r = await cohere.v2.embed({ model: 'embed-multilingual-v3.0', inputType: 'search_document', // 또는 search_query texts: ['안녕'], }); ``` → 100+ language. ### Asymmetric (query vs document) ```ts // 일부 model 은 query 와 document 가 다른 instruction const queryEmb = await embed('Represent this sentence for searching: ' + query); const docEmb = await embed(doc); // Or built-in (Voyage, Cohere) const queryEmb = await voyage.embed({ input: [query], inputType: 'query' }); const docEmb = await voyage.embed({ input: [doc], inputType: 'document' }); ``` ### Matryoshka (변동 차원) ```ts // OpenAI 3-large, Voyage const r = await openai.embeddings.create({ model: 'text-embedding-3-large', input: text, dimensions: 256, // 대신 3072 }); ``` → 작은 dim = 작은 cost, 90%+ accuracy 유지. ### Rerank (embedding 후 정밀) ```ts // 1. Embed search → top 50 const candidates = await embeddingSearch(query, 50); // 2. Rerank → top 5 const reranked = await cohere.rerank({ model: 'rerank-3.5', query, documents: candidates.map(c => c.text), topN: 5, }); return reranked.results.map(r => candidates[r.index]); ``` → 큰 향상. Cross-encoder reranker. ### Quantization (storage 절약) ```python # Float32 → int8 (4x 작음, accuracy 유지) embeddings_int8 = quantize(embeddings_float32) # Or binary (32x smaller) embeddings_binary = (embeddings > 0).astype('uint8') ``` → Memory / cost 절약 + 빠른 search. ### MTEB benchmark ``` Massive Text Embedding Benchmark. Domain / task 별 ranking. → 시작 model 선택 가이드. ``` ### Code embeddings ``` - voyage-code-3 (best 2024) - jinaai/jina-embeddings-v2-base-code - microsoft/codebert - togethercomputer/m2-bert-80M-32k-retrieval Use case: - Code search (find function by query) - Code completion ranking - Bug similarity ``` ### Multi-modal embedding ```python # CLIP — text + image 같은 vector space from sentence_transformers import SentenceTransformer model = SentenceTransformer('clip-ViT-B-32') text_emb = model.encode(['a cat']) image_emb = model.encode(Image.open('cat.jpg')) similarity = cosine(text_emb, image_emb) ``` → Image search by text. ### Inference optimization ```python # ONNX export (10-20x 빠름) from optimum.onnxruntime import ORTModelForFeatureExtraction model = ORTModelForFeatureExtraction.from_pretrained( 'BAAI/bge-base-en-v1.5', export=True, ) # CPU inference 빠름 ``` ```python # Sentence Transformers ONNX model = SentenceTransformer('BAAI/bge-base-en-v1.5', backend='onnx') ``` ### Self-host inference (Triton, vLLM) ```bash # vLLM (LLM 도, embedding 도) vllm serve BAAI/bge-large-en-v1.5 --task=embed # Or Sentence Transformers + Flask / FastAPI ``` ### CDC + embedding (auto re-index) ```ts // Doc 변경 → embedding 다시 on('document.updated', async (doc) => { const newEmb = await embed(doc.content); await vectorDB.upsert(doc.id, newEmb); }); ``` ### Cost (대략) ``` OpenAI text-embedding-3-small: $0.02/1M tok Voyage 3.5: $0.06/1M tok Cohere embed-v3: $0.10/1M tok Self-host: GPU cost only → Big volume = self-host (BGE / Voyage). Quality strict = Voyage 3 / Cohere v3. ``` ### Embedding cache ```ts const cache = new Map(); async function embed(text: string) { const hash = sha256(text); if (cache.has(hash)) return cache.get(hash)!; const emb = await api.embed(text); cache.set(hash, emb); return emb; } ``` ### Drift / refresh ``` Domain 변경 / 새 lang / 새 abbreviation: - 정기 re-evaluate - Model 갱신 → 모든 doc 재 embed - 큰 cost — 계획 필요 ``` ### Hyperparameter ```python # Batch size: GPU memory 따라 (32-128) # Learning rate: 1e-5 ~ 5e-5 # Epochs: 1-5 (overfit 주의) # Margin (triplet): 0.5 # Temperature (contrastive): 0.05-0.1 ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | 일반 web | OpenAI 3-small / Voyage | | 코드 | Voyage code-3 | | Legal / medical | Domain-specific BERT + fine-tune | | Multi-language | Cohere multilingual / BGE-M3 | | Self-host privacy | BGE / Sentence Transformers | | 매우 가벼운 | Quantized BGE | ## ❌ 안티패턴 - **General embedding + domain 가정**: 약함 — fine-tune. - **Hard negative 없음**: 약한 fine-tune. - **Test 안 — eval 무**: 향상 모름. - **Overfit (적은 data + 많은 epoch)**: validate. - **Asymmetric model 가정 + symmetric 사용**: prompt 다름. - **Quantization 가정 + accuracy check 없음**: 검증. ## 🤖 LLM 활용 힌트 - 일반 = OpenAI / Voyage. Domain = fine-tune. - Pair generation 가 LLM 으로 빠름. - Hard negative + reranker = 큰 향상. - MTEB 가 시작 가이드. ## 🔗 관련 문서 - [[AI_Embeddings_Comparison]] - [[AI_RAG_Advanced]] - [[AI_Fine_Tuning_vs_Prompting]]