---
id: wiki-2026-0508-similarity-metrics
title: Similarity Metrics
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Distance Metrics, Similarity Functions]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [machine-learning, retrieval, embeddings, distance]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: NumPy/FAISS
---

# Similarity Metrics

## 매 한 줄
> **"매 두 vector 의 close 의 measure"**. Classical IR (Salton, 1970s) → modern embedding-based retrieval (RAG with Claude Opus 4.7, 2026) — 매 cosine 의 default, 매 task-specific tuning 의 critical.

## 매 핵심

### 매 주요 metrics
- **Cosine**: `cos(a,b) = (a·b)/(||a|| ||b||)` ∈ [-1, 1]. Direction only, scale-invariant.
- **Dot product**: `a·b`. Scale-sensitive — large norm dominates.
- **Euclidean (L2)**: `||a-b||₂`. Geometric distance, sensitive to magnitude.
- **Manhattan (L1)**: `Σ|aᵢ-bᵢ|`. Robust to outliers.
- **Jaccard**: `|A∩B|/|A∪B|`. Set similarity.
- **Hamming**: count of differing positions. Binary vectors.
- **Edit (Levenshtein)**: min insertions/deletions/substitutions. Strings.

### 매 핵심 관계
- Normalized vectors (||v||=1): cosine = dot product = `1 - L2²/2`.
- 매 modern embedding model (OpenAI text-embedding-3, voyage-3, BGE-M3) 의 normalized output → cosine ≡ dot.

### 매 응용
1. RAG retrieval (text embeddings + cosine).
2. Image search (CLIP embeddings).
3. Recommender systems (item-item).
4. Deduplication (near-duplicate detection).
5. Clustering (k-means uses Euclidean).

## 💻 패턴

### Cosine similarity
```python
import numpy as np

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Batch
def cosine_matrix(A, B):
    A_norm = A / np.linalg.norm(A, axis=1, keepdims=True)
    B_norm = B / np.linalg.norm(B, axis=1, keepdims=True)
    return A_norm @ B_norm.T
```

### Sklearn pairwise
```python
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

sim = cosine_similarity(X, Y)  # (n_X, n_Y)
dist = euclidean_distances(X, Y)
```

### FAISS (large-scale)
```python
import faiss
import numpy as np

embeddings = np.random.randn(1_000_000, 768).astype('float32')
faiss.normalize_L2(embeddings)  # for cosine via inner product

index = faiss.IndexFlatIP(768)  # inner product = cosine on normalized
index.add(embeddings)

query = np.random.randn(1, 768).astype('float32')
faiss.normalize_L2(query)
D, I = index.search(query, k=10)  # top-10
```

### RAG with embeddings (2026)
```python
from anthropic import Anthropic
import voyageai

vo = voyageai.Client()
client = Anthropic()

def embed(texts):
    r = vo.embed(texts, model="voyage-3", input_type="document")
    return np.array(r.embeddings)

docs_emb = embed(corpus)
q_emb = vo.embed([query], model="voyage-3", input_type="query").embeddings[0]

scores = docs_emb @ np.array(q_emb)  # cosine on normalized
top_k = np.argsort(scores)[-5:][::-1]
```

### Jaccard for sets
```python
def jaccard(a: set, b: set) -> float:
    if not a and not b:
        return 1.0
    return len(a & b) / len(a | b)

# MinHash for approximate Jaccard at scale
from datasketch import MinHash
m1, m2 = MinHash(), MinHash()
for w in doc1.split(): m1.update(w.encode())
for w in doc2.split(): m2.update(w.encode())
print(m1.jaccard(m2))
```

### Edit distance
```python
def levenshtein(a, b):
    if len(a) < len(b): a, b = b, a
    prev = list(range(len(b) + 1))
    for i, ca in enumerate(a, 1):
        curr = [i]
        for j, cb in enumerate(b, 1):
            curr.append(min(
                curr[-1] + 1,
                prev[j] + 1,
                prev[j-1] + (ca != cb)
            ))
        prev = curr
    return prev[-1]
```

### Hybrid search (BM25 + dense)
```python
# Reciprocal Rank Fusion
def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

bm25_top = bm25_search(query)
dense_top = vector_search(query)
fused = rrf([bm25_top, dense_top])
```

## 매 결정 기준
| 상황 | Metric |
|---|---|
| Text embeddings (LLM, BERT) | Cosine |
| Image embeddings (CLIP) | Cosine |
| Tabular numeric features | Euclidean (after StandardScaler) |
| Sparse binary features | Jaccard |
| Strings / typo-tolerant | Levenshtein |
| Hashes / fingerprints | Hamming |
| L1-sparse (Lasso embeddings) | Manhattan |

**기본값**: normalized embeddings + cosine (= dot product on unit vectors).

## 🔗 Graph
- 부모: [[Information Retrieval]] · [[Embeddings]]
- 응용: [[RAG]] · [[Recommender-Systems]] · [[Clustering]]
- Adjacent: [[FAISS]] · [[BM25]] · [[Locality-Sensitive-Hashing (LSH)|Locality-Sensitive-Hashing]]

## 🤖 LLM 활용
**언제**: RAG retrieval ranking. Semantic search. Deduplication. Clustering. Nearest-neighbor classification. Recommender similarity.
**언제 X**: Hierarchical / structured similarity (use tree edit distance, graph kernels). Causal similarity (use DTW for time-series).

## ❌ 안티패턴
- **Cosine on un-normalized**: 매 normalization 의 forget 시 — `normalize_L2` 의 explicit call.
- **Euclidean without scaling**: feature 의 다른 scale 의 → larger-scale feature 가 dominate.
- **Jaccard on dense vectors**: 매 set similarity — dense float vec 에 X.
- **Magnitude-blind cosine for ranking**: cosine 의 direction only — popularity / confidence 의 X capture.

## 🧪 검증 / 중복
- Verified (Salton & McGill IR text, MTEB benchmark 2025, FAISS docs).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Similarity metrics with cosine/L2/Jaccard/edit, FAISS, RAG, hybrid |