Files
2nd/10_Wiki/Topics/Coding/AI_Custom_Embeddings.md
T
2026-05-09 22:47:42 +09:00

9.4 KiB

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-custom-embeddings Custom Embeddings — Fine-tune / Domain-specific Coding draft B conceptual 2026-05-09 2026-05-09
ai
embeddings
fine-tune
vibe-coding
language applicable_to
Python / TS
Backend
embedding fine-tune
domain embeddings
sentence transformers
BGE
contrastive learning

Custom Embeddings

일반 embedding 가 domain (legal, medical, code) 에 약함. Domain-specific fine-tune 또는 dedicated model. Sentence Transformers, BGE, Voyage, Cohere.

📖 핵심 개념

  • General: 일반 web text — 도메인 약함.
  • Domain: legal / code / medical etc.
  • Fine-tune: pair-based contrastive learning.
  • Reranker: 다른 task — embedding 후 정밀.

💻 코드 패턴

When to fine-tune

일반 embedding 가 OK:
- Web content
- General Q&A
- 일반 search

Custom 가치:
- Legal document
- Medical records
- Code retrieval
- 회사 jargon / abbreviations
- Multi-language (특정 lang)
- Domain (e-commerce, real estate)

Sentence Transformers (fine-tune)

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Base model
model = SentenceTransformer('BAAI/bge-base-en-v1.5')

# Training data: similar pairs
train_examples = [
    InputExample(texts=['Q: refund policy', 'A: We offer 30 day refunds for...'], label=0.9),
    InputExample(texts=['Q: refund', 'A: We offer 30 day refunds for...'], label=0.8),
    InputExample(texts=['Q: refund', 'A: Today is sunny'], label=0.0),  # negative
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=3,
    warmup_steps=100,
    output_path='./domain-embeddings',
)

Triplet loss (positive / negative)

from sentence_transformers import InputExample, losses

train_examples = [
    InputExample(texts=[
        'How to refund?',           # anchor
        'Refund policy: 30 days...',  # positive
        'Today is sunny',             # negative
    ]),
]

train_loss = losses.TripletLoss(model=model)

Pair generation (LLM 으로)

async def generate_pairs(documents):
    pairs = []
    for doc in documents:
        # LLM 가 이 doc 의 query 생성
        queries = await llm.generate(f"Generate 3 user queries that this answers:\n{doc}")
        for q in queries:
            pairs.append((q, doc, 1.0))  # positive
        
        # Random negative
        random_doc = random.choice(documents)
        pairs.append((queries[0], random_doc, 0.0))  # negative (가능 — sometimes positive)
    
    return pairs

→ Synthetic training data.

Hard negative mining

# Random negative = easy.
# Better: similar but wrong = hard negative.

for query, positive_doc in queries:
    # 일반 embedding 로 top 10 검색
    top_10 = embed_search(query, k=10)
    
    # Positive 가 top_10 에 있다면 — 다른 docs = hard negatives
    for doc in top_10:
        if doc != positive_doc:
            pairs.append((query, doc, 0.0))

→ 더 좋은 fine-tune.

Evaluation

from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
    test_examples,
    name='domain-test',
)

# Evaluator 가 model 에 적용
score = evaluator(model, output_path='./eval')
print(f'Similarity score: {score}')
# Top-K accuracy
def evaluate(model, queries, docs, ground_truth):
    correct = 0
    for q, true_doc in zip(queries, ground_truth):
        embeddings = model.encode([q] + docs)
        scores = cosine_similarity(embeddings[0], embeddings[1:])
        top_k = np.argsort(scores)[-10:]
        if true_doc in [docs[i] for i in top_k]:
            correct += 1
    return correct / len(queries)

Domain-specific models (off-the-shelf)

Code:
- microsoft/codebert-base
- jinaai/jina-embeddings-v2-base-code

Legal:
- nlpaueb/legal-bert-base-uncased

Medical:
- emilyalsentzer/Bio_ClinicalBERT
- microsoft/BiomedNLP-PubMedBERT

Multi-language:
- BAAI/bge-m3
- intfloat/multilingual-e5-large

→ Fine-tune 전 domain model 사용.

Voyage AI (best general)

import { VoyageAIClient } from 'voyageai';

const voyage = new VoyageAIClient({ apiKey });

// General
const r = await voyage.embed({
  model: 'voyage-3.5',
  input: ['text1', 'text2'],
});

// Code
const r = await voyage.embed({
  model: 'voyage-code-3',  // code-specific
  input: ['function ...', 'class ...'],
});

→ General + domain options.

Cohere (multilingual)

const r = await cohere.v2.embed({
  model: 'embed-multilingual-v3.0',
  inputType: 'search_document',  // 또는 search_query
  texts: ['안녕'],
});

→ 100+ language.

Asymmetric (query vs document)

// 일부 model 은 query 와 document 가 다른 instruction
const queryEmb = await embed('Represent this sentence for searching: ' + query);
const docEmb = await embed(doc);

// Or built-in (Voyage, Cohere)
const queryEmb = await voyage.embed({ input: [query], inputType: 'query' });
const docEmb = await voyage.embed({ input: [doc], inputType: 'document' });

Matryoshka (변동 차원)

// OpenAI 3-large, Voyage
const r = await openai.embeddings.create({
  model: 'text-embedding-3-large',
  input: text,
  dimensions: 256,  // 대신 3072
});

→ 작은 dim = 작은 cost, 90%+ accuracy 유지.

Rerank (embedding 후 정밀)

// 1. Embed search → top 50
const candidates = await embeddingSearch(query, 50);

// 2. Rerank → top 5
const reranked = await cohere.rerank({
  model: 'rerank-3.5',
  query,
  documents: candidates.map(c => c.text),
  topN: 5,
});

return reranked.results.map(r => candidates[r.index]);

→ 큰 향상. Cross-encoder reranker.

Quantization (storage 절약)

# Float32 → int8 (4x 작음, accuracy 유지)
embeddings_int8 = quantize(embeddings_float32)

# Or binary (32x smaller)
embeddings_binary = (embeddings > 0).astype('uint8')

→ Memory / cost 절약 + 빠른 search.

MTEB benchmark

Massive Text Embedding Benchmark.
Domain / task 별 ranking.

→ 시작 model 선택 가이드.

Code embeddings

- voyage-code-3 (best 2024)
- jinaai/jina-embeddings-v2-base-code
- microsoft/codebert
- togethercomputer/m2-bert-80M-32k-retrieval

Use case:
- Code search (find function by query)
- Code completion ranking
- Bug similarity

Multi-modal embedding

# CLIP — text + image 같은 vector space
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('clip-ViT-B-32')

text_emb = model.encode(['a cat'])
image_emb = model.encode(Image.open('cat.jpg'))

similarity = cosine(text_emb, image_emb)

→ Image search by text.

Inference optimization

# ONNX export (10-20x 빠름)
from optimum.onnxruntime import ORTModelForFeatureExtraction

model = ORTModelForFeatureExtraction.from_pretrained(
    'BAAI/bge-base-en-v1.5',
    export=True,
)

# CPU inference 빠름
# Sentence Transformers ONNX
model = SentenceTransformer('BAAI/bge-base-en-v1.5', backend='onnx')

Self-host inference (Triton, vLLM)

# vLLM (LLM 도, embedding 도)
vllm serve BAAI/bge-large-en-v1.5 --task=embed

# Or Sentence Transformers + Flask / FastAPI

CDC + embedding (auto re-index)

// Doc 변경 → embedding 다시
on('document.updated', async (doc) => {
  const newEmb = await embed(doc.content);
  await vectorDB.upsert(doc.id, newEmb);
});

Cost (대략)

OpenAI text-embedding-3-small: $0.02/1M tok
Voyage 3.5: $0.06/1M tok
Cohere embed-v3: $0.10/1M tok
Self-host: GPU cost only

→ Big volume = self-host (BGE / Voyage).
   Quality strict = Voyage 3 / Cohere v3.

Embedding cache

const cache = new Map<string, Float32Array>();

async function embed(text: string) {
  const hash = sha256(text);
  if (cache.has(hash)) return cache.get(hash)!;
  
  const emb = await api.embed(text);
  cache.set(hash, emb);
  return emb;
}

Drift / refresh

Domain 변경 / 새 lang / 새 abbreviation:
- 정기 re-evaluate
- Model 갱신 → 모든 doc 재 embed
- 큰 cost — 계획 필요

Hyperparameter

# Batch size: GPU memory 따라 (32-128)
# Learning rate: 1e-5 ~ 5e-5
# Epochs: 1-5 (overfit 주의)
# Margin (triplet): 0.5
# Temperature (contrastive): 0.05-0.1

🤔 의사결정 기준

상황 추천
일반 web OpenAI 3-small / Voyage
코드 Voyage code-3
Legal / medical Domain-specific BERT + fine-tune
Multi-language Cohere multilingual / BGE-M3
Self-host privacy BGE / Sentence Transformers
매우 가벼운 Quantized BGE

안티패턴

  • General embedding + domain 가정: 약함 — fine-tune.
  • Hard negative 없음: 약한 fine-tune.
  • Test 안 — eval 무: 향상 모름.
  • Overfit (적은 data + 많은 epoch): validate.
  • Asymmetric model 가정 + symmetric 사용: prompt 다름.
  • Quantization 가정 + accuracy check 없음: 검증.

🤖 LLM 활용 힌트

  • 일반 = OpenAI / Voyage. Domain = fine-tune.
  • Pair generation 가 LLM 으로 빠름.
  • Hard negative + reranker = 큰 향상.
  • MTEB 가 시작 가이드.

🔗 관련 문서