[G1-Sync] Manual knowledge update
This commit is contained in:
@@ -0,0 +1,391 @@
|
||||
---
|
||||
id: ai-custom-embeddings
|
||||
title: Custom Embeddings — Fine-tune / Domain-specific
|
||||
category: Coding
|
||||
status: draft
|
||||
source_trust_level: B
|
||||
verification_status: conceptual
|
||||
created_at: 2026-05-09
|
||||
updated_at: 2026-05-09
|
||||
tags: [ai, embeddings, fine-tune, vibe-coding]
|
||||
tech_stack: { language: "Python / TS", applicable_to: ["Backend"] }
|
||||
applied_in: []
|
||||
aliases: [embedding fine-tune, domain embeddings, sentence transformers, BGE, contrastive learning]
|
||||
---
|
||||
|
||||
# Custom Embeddings
|
||||
|
||||
> 일반 embedding 가 domain (legal, medical, code) 에 약함. **Domain-specific fine-tune 또는 dedicated model**. Sentence Transformers, BGE, Voyage, Cohere.
|
||||
|
||||
## 📖 핵심 개념
|
||||
- General: 일반 web text — 도메인 약함.
|
||||
- Domain: legal / code / medical etc.
|
||||
- Fine-tune: pair-based contrastive learning.
|
||||
- Reranker: 다른 task — embedding 후 정밀.
|
||||
|
||||
## 💻 코드 패턴
|
||||
|
||||
### When to fine-tune
|
||||
```
|
||||
일반 embedding 가 OK:
|
||||
- Web content
|
||||
- General Q&A
|
||||
- 일반 search
|
||||
|
||||
Custom 가치:
|
||||
- Legal document
|
||||
- Medical records
|
||||
- Code retrieval
|
||||
- 회사 jargon / abbreviations
|
||||
- Multi-language (특정 lang)
|
||||
- Domain (e-commerce, real estate)
|
||||
```
|
||||
|
||||
### Sentence Transformers (fine-tune)
|
||||
```python
|
||||
from sentence_transformers import SentenceTransformer, InputExample, losses
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
# Base model
|
||||
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
|
||||
|
||||
# Training data: similar pairs
|
||||
train_examples = [
|
||||
InputExample(texts=['Q: refund policy', 'A: We offer 30 day refunds for...'], label=0.9),
|
||||
InputExample(texts=['Q: refund', 'A: We offer 30 day refunds for...'], label=0.8),
|
||||
InputExample(texts=['Q: refund', 'A: Today is sunny'], label=0.0), # negative
|
||||
]
|
||||
|
||||
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
|
||||
train_loss = losses.CosineSimilarityLoss(model)
|
||||
|
||||
model.fit(
|
||||
train_objectives=[(train_dataloader, train_loss)],
|
||||
epochs=3,
|
||||
warmup_steps=100,
|
||||
output_path='./domain-embeddings',
|
||||
)
|
||||
```
|
||||
|
||||
### Triplet loss (positive / negative)
|
||||
```python
|
||||
from sentence_transformers import InputExample, losses
|
||||
|
||||
train_examples = [
|
||||
InputExample(texts=[
|
||||
'How to refund?', # anchor
|
||||
'Refund policy: 30 days...', # positive
|
||||
'Today is sunny', # negative
|
||||
]),
|
||||
]
|
||||
|
||||
train_loss = losses.TripletLoss(model=model)
|
||||
```
|
||||
|
||||
### Pair generation (LLM 으로)
|
||||
```python
|
||||
async def generate_pairs(documents):
|
||||
pairs = []
|
||||
for doc in documents:
|
||||
# LLM 가 이 doc 의 query 생성
|
||||
queries = await llm.generate(f"Generate 3 user queries that this answers:\n{doc}")
|
||||
for q in queries:
|
||||
pairs.append((q, doc, 1.0)) # positive
|
||||
|
||||
# Random negative
|
||||
random_doc = random.choice(documents)
|
||||
pairs.append((queries[0], random_doc, 0.0)) # negative (가능 — sometimes positive)
|
||||
|
||||
return pairs
|
||||
```
|
||||
|
||||
→ Synthetic training data.
|
||||
|
||||
### Hard negative mining
|
||||
```python
|
||||
# Random negative = easy.
|
||||
# Better: similar but wrong = hard negative.
|
||||
|
||||
for query, positive_doc in queries:
|
||||
# 일반 embedding 로 top 10 검색
|
||||
top_10 = embed_search(query, k=10)
|
||||
|
||||
# Positive 가 top_10 에 있다면 — 다른 docs = hard negatives
|
||||
for doc in top_10:
|
||||
if doc != positive_doc:
|
||||
pairs.append((query, doc, 0.0))
|
||||
```
|
||||
|
||||
→ 더 좋은 fine-tune.
|
||||
|
||||
### Evaluation
|
||||
```python
|
||||
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
|
||||
|
||||
evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
|
||||
test_examples,
|
||||
name='domain-test',
|
||||
)
|
||||
|
||||
# Evaluator 가 model 에 적용
|
||||
score = evaluator(model, output_path='./eval')
|
||||
print(f'Similarity score: {score}')
|
||||
```
|
||||
|
||||
```python
|
||||
# Top-K accuracy
|
||||
def evaluate(model, queries, docs, ground_truth):
|
||||
correct = 0
|
||||
for q, true_doc in zip(queries, ground_truth):
|
||||
embeddings = model.encode([q] + docs)
|
||||
scores = cosine_similarity(embeddings[0], embeddings[1:])
|
||||
top_k = np.argsort(scores)[-10:]
|
||||
if true_doc in [docs[i] for i in top_k]:
|
||||
correct += 1
|
||||
return correct / len(queries)
|
||||
```
|
||||
|
||||
### Domain-specific models (off-the-shelf)
|
||||
```
|
||||
Code:
|
||||
- microsoft/codebert-base
|
||||
- jinaai/jina-embeddings-v2-base-code
|
||||
|
||||
Legal:
|
||||
- nlpaueb/legal-bert-base-uncased
|
||||
|
||||
Medical:
|
||||
- emilyalsentzer/Bio_ClinicalBERT
|
||||
- microsoft/BiomedNLP-PubMedBERT
|
||||
|
||||
Multi-language:
|
||||
- BAAI/bge-m3
|
||||
- intfloat/multilingual-e5-large
|
||||
```
|
||||
|
||||
→ Fine-tune 전 domain model 사용.
|
||||
|
||||
### Voyage AI (best general)
|
||||
```ts
|
||||
import { VoyageAIClient } from 'voyageai';
|
||||
|
||||
const voyage = new VoyageAIClient({ apiKey });
|
||||
|
||||
// General
|
||||
const r = await voyage.embed({
|
||||
model: 'voyage-3.5',
|
||||
input: ['text1', 'text2'],
|
||||
});
|
||||
|
||||
// Code
|
||||
const r = await voyage.embed({
|
||||
model: 'voyage-code-3', // code-specific
|
||||
input: ['function ...', 'class ...'],
|
||||
});
|
||||
```
|
||||
|
||||
→ General + domain options.
|
||||
|
||||
### Cohere (multilingual)
|
||||
```ts
|
||||
const r = await cohere.v2.embed({
|
||||
model: 'embed-multilingual-v3.0',
|
||||
inputType: 'search_document', // 또는 search_query
|
||||
texts: ['안녕'],
|
||||
});
|
||||
```
|
||||
|
||||
→ 100+ language.
|
||||
|
||||
### Asymmetric (query vs document)
|
||||
```ts
|
||||
// 일부 model 은 query 와 document 가 다른 instruction
|
||||
const queryEmb = await embed('Represent this sentence for searching: ' + query);
|
||||
const docEmb = await embed(doc);
|
||||
|
||||
// Or built-in (Voyage, Cohere)
|
||||
const queryEmb = await voyage.embed({ input: [query], inputType: 'query' });
|
||||
const docEmb = await voyage.embed({ input: [doc], inputType: 'document' });
|
||||
```
|
||||
|
||||
### Matryoshka (변동 차원)
|
||||
```ts
|
||||
// OpenAI 3-large, Voyage
|
||||
const r = await openai.embeddings.create({
|
||||
model: 'text-embedding-3-large',
|
||||
input: text,
|
||||
dimensions: 256, // 대신 3072
|
||||
});
|
||||
```
|
||||
|
||||
→ 작은 dim = 작은 cost, 90%+ accuracy 유지.
|
||||
|
||||
### Rerank (embedding 후 정밀)
|
||||
```ts
|
||||
// 1. Embed search → top 50
|
||||
const candidates = await embeddingSearch(query, 50);
|
||||
|
||||
// 2. Rerank → top 5
|
||||
const reranked = await cohere.rerank({
|
||||
model: 'rerank-3.5',
|
||||
query,
|
||||
documents: candidates.map(c => c.text),
|
||||
topN: 5,
|
||||
});
|
||||
|
||||
return reranked.results.map(r => candidates[r.index]);
|
||||
```
|
||||
|
||||
→ 큰 향상. Cross-encoder reranker.
|
||||
|
||||
### Quantization (storage 절약)
|
||||
```python
|
||||
# Float32 → int8 (4x 작음, accuracy 유지)
|
||||
embeddings_int8 = quantize(embeddings_float32)
|
||||
|
||||
# Or binary (32x smaller)
|
||||
embeddings_binary = (embeddings > 0).astype('uint8')
|
||||
```
|
||||
|
||||
→ Memory / cost 절약 + 빠른 search.
|
||||
|
||||
### MTEB benchmark
|
||||
```
|
||||
Massive Text Embedding Benchmark.
|
||||
Domain / task 별 ranking.
|
||||
|
||||
→ 시작 model 선택 가이드.
|
||||
```
|
||||
|
||||
### Code embeddings
|
||||
```
|
||||
- voyage-code-3 (best 2024)
|
||||
- jinaai/jina-embeddings-v2-base-code
|
||||
- microsoft/codebert
|
||||
- togethercomputer/m2-bert-80M-32k-retrieval
|
||||
|
||||
Use case:
|
||||
- Code search (find function by query)
|
||||
- Code completion ranking
|
||||
- Bug similarity
|
||||
```
|
||||
|
||||
### Multi-modal embedding
|
||||
```python
|
||||
# CLIP — text + image 같은 vector space
|
||||
from sentence_transformers import SentenceTransformer
|
||||
model = SentenceTransformer('clip-ViT-B-32')
|
||||
|
||||
text_emb = model.encode(['a cat'])
|
||||
image_emb = model.encode(Image.open('cat.jpg'))
|
||||
|
||||
similarity = cosine(text_emb, image_emb)
|
||||
```
|
||||
|
||||
→ Image search by text.
|
||||
|
||||
### Inference optimization
|
||||
```python
|
||||
# ONNX export (10-20x 빠름)
|
||||
from optimum.onnxruntime import ORTModelForFeatureExtraction
|
||||
|
||||
model = ORTModelForFeatureExtraction.from_pretrained(
|
||||
'BAAI/bge-base-en-v1.5',
|
||||
export=True,
|
||||
)
|
||||
|
||||
# CPU inference 빠름
|
||||
```
|
||||
|
||||
```python
|
||||
# Sentence Transformers ONNX
|
||||
model = SentenceTransformer('BAAI/bge-base-en-v1.5', backend='onnx')
|
||||
```
|
||||
|
||||
### Self-host inference (Triton, vLLM)
|
||||
```bash
|
||||
# vLLM (LLM 도, embedding 도)
|
||||
vllm serve BAAI/bge-large-en-v1.5 --task=embed
|
||||
|
||||
# Or Sentence Transformers + Flask / FastAPI
|
||||
```
|
||||
|
||||
### CDC + embedding (auto re-index)
|
||||
```ts
|
||||
// Doc 변경 → embedding 다시
|
||||
on('document.updated', async (doc) => {
|
||||
const newEmb = await embed(doc.content);
|
||||
await vectorDB.upsert(doc.id, newEmb);
|
||||
});
|
||||
```
|
||||
|
||||
### Cost (대략)
|
||||
```
|
||||
OpenAI text-embedding-3-small: $0.02/1M tok
|
||||
Voyage 3.5: $0.06/1M tok
|
||||
Cohere embed-v3: $0.10/1M tok
|
||||
Self-host: GPU cost only
|
||||
|
||||
→ Big volume = self-host (BGE / Voyage).
|
||||
Quality strict = Voyage 3 / Cohere v3.
|
||||
```
|
||||
|
||||
### Embedding cache
|
||||
```ts
|
||||
const cache = new Map<string, Float32Array>();
|
||||
|
||||
async function embed(text: string) {
|
||||
const hash = sha256(text);
|
||||
if (cache.has(hash)) return cache.get(hash)!;
|
||||
|
||||
const emb = await api.embed(text);
|
||||
cache.set(hash, emb);
|
||||
return emb;
|
||||
}
|
||||
```
|
||||
|
||||
### Drift / refresh
|
||||
```
|
||||
Domain 변경 / 새 lang / 새 abbreviation:
|
||||
- 정기 re-evaluate
|
||||
- Model 갱신 → 모든 doc 재 embed
|
||||
- 큰 cost — 계획 필요
|
||||
```
|
||||
|
||||
### Hyperparameter
|
||||
```python
|
||||
# Batch size: GPU memory 따라 (32-128)
|
||||
# Learning rate: 1e-5 ~ 5e-5
|
||||
# Epochs: 1-5 (overfit 주의)
|
||||
# Margin (triplet): 0.5
|
||||
# Temperature (contrastive): 0.05-0.1
|
||||
```
|
||||
|
||||
## 🤔 의사결정 기준
|
||||
| 상황 | 추천 |
|
||||
|---|---|
|
||||
| 일반 web | OpenAI 3-small / Voyage |
|
||||
| 코드 | Voyage code-3 |
|
||||
| Legal / medical | Domain-specific BERT + fine-tune |
|
||||
| Multi-language | Cohere multilingual / BGE-M3 |
|
||||
| Self-host privacy | BGE / Sentence Transformers |
|
||||
| 매우 가벼운 | Quantized BGE |
|
||||
|
||||
## ❌ 안티패턴
|
||||
- **General embedding + domain 가정**: 약함 — fine-tune.
|
||||
- **Hard negative 없음**: 약한 fine-tune.
|
||||
- **Test 안 — eval 무**: 향상 모름.
|
||||
- **Overfit (적은 data + 많은 epoch)**: validate.
|
||||
- **Asymmetric model 가정 + symmetric 사용**: prompt 다름.
|
||||
- **Quantization 가정 + accuracy check 없음**: 검증.
|
||||
|
||||
## 🤖 LLM 활용 힌트
|
||||
- 일반 = OpenAI / Voyage. Domain = fine-tune.
|
||||
- Pair generation 가 LLM 으로 빠름.
|
||||
- Hard negative + reranker = 큰 향상.
|
||||
- MTEB 가 시작 가이드.
|
||||
|
||||
## 🔗 관련 문서
|
||||
- [[AI_Embeddings_Comparison]]
|
||||
- [[AI_RAG_Advanced]]
|
||||
- [[AI_Fine_Tuning_vs_Prompting]]
|
||||
Reference in New Issue
Block a user