Files
2nd/10_Wiki/Topics/AI_and_ML/Similarity-Metrics-in-AI.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

179 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-similarity-metrics-in-ai
title: Similarity Metrics in AI
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Similarity Measures, Distance Metrics, Vector Similarity]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [similarity, embeddings, retrieval, vector-search]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: numpy/faiss/sentence-transformers
---
# Similarity Metrics in AI
## 매 한 줄
> **"매 similarity 의 metric choice 는 매 retrieval / clustering / matching quality 의 결정"**. 매 cosine 의 dominant 의 dense embedding semantic search, 매 Jaccard 의 set overlap, 매 edit distance 의 string fuzzy matching. 매 2026 의 modern stack 의 normalized cosine + ANN (HNSW/IVF-PQ) 의 standard.
## 매 핵심
### 매 Vector metrics
- **Cosine similarity**: `dot(a,b) / (||a|| * ||b||)` — 매 magnitude-invariant. 매 embedding 의 default.
- **Dot product**: 매 normalized embedding 의 cosine 과 equivalent. 매 faster (no division).
- **Euclidean (L2)**: 매 raw distance. 매 cluster centroid / k-means 의 use.
- **Manhattan (L1)**: 매 robust to outliers. 매 sparse feature 의 use.
### 매 Set / String metrics
- **Jaccard**: `|A ∩ B| / |A B|` — 매 set / token overlap.
- **Levenshtein (edit distance)**: 매 character-level fuzzy match.
- **Hamming**: 매 fixed-length binary / hash 의 distance.
- **Tanimoto**: 매 chemistry / fingerprint similarity.
### 매 응용
1. **Semantic search** — sentence-transformer embedding + cosine + FAISS HNSW.
2. **Deduplication** — MinHash + Jaccard 의 near-duplicate detection.
3. **Recommendation** — user/item embedding cosine.
4. **Fuzzy matching** — record linkage 의 Levenshtein / Jaro-Winkler.
## 💻 패턴
### Cosine similarity (numpy)
```python
import numpy as np
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12))
def cosine_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
A_n = A / (np.linalg.norm(A, axis=1, keepdims=True) + 1e-12)
B_n = B / (np.linalg.norm(B, axis=1, keepdims=True) + 1e-12)
return A_n @ B_n.T
```
### Sentence embedding + FAISS (2026 stack)
```python
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
docs = ["alpha doc", "beta doc", "gamma doc"]
emb = model.encode(docs, normalize_embeddings=True).astype("float32")
index = faiss.IndexHNSWFlat(emb.shape[1], 32)
index.metric_type = faiss.METRIC_INNER_PRODUCT # cosine via normalized
index.add(emb)
q = model.encode(["alpha"], normalize_embeddings=True).astype("float32")
D, I = index.search(q, k=3)
```
### Jaccard via MinHash (datasketch)
```python
from datasketch import MinHash, MinHashLSH
def mh(tokens, num_perm=128):
m = MinHash(num_perm=num_perm)
for t in tokens:
m.update(t.encode("utf-8"))
return m
lsh = MinHashLSH(threshold=0.7, num_perm=128)
lsh.insert("doc1", mh("the quick brown fox".split()))
lsh.insert("doc2", mh("the quick brown dog".split()))
print(lsh.query(mh("the quick brown fox jumps".split())))
```
### Levenshtein (rapidfuzz)
```python
from rapidfuzz.distance import Levenshtein
from rapidfuzz import fuzz, process
print(Levenshtein.distance("kitten", "sitting")) # 3
print(fuzz.ratio("apple inc.", "apple, inc")) # ~95
choices = ["Acme Corp", "Apple Inc.", "Microsoft"]
print(process.extractOne("aple", choices, scorer=fuzz.ratio))
```
### Euclidean vs cosine (when matters)
```python
# Cosine: angle only — magnitude ignored
a = np.array([1.0, 0.0]); b = np.array([10.0, 0.0])
# cosine(a,b) = 1.0 (identical direction)
# euclidean(a,b) = 9.0 (very different magnitude)
```
### Hybrid retrieval (BM25 + dense)
```python
# 매 modern RAG 의 default — sparse + dense fusion
from rank_bm25 import BM25Okapi
import numpy as np
tokenized = [d.split() for d in docs]
bm25 = BM25Okapi(tokenized)
sparse_scores = bm25.get_scores("alpha doc".split())
dense_scores = (emb @ q.T).flatten()
# Reciprocal Rank Fusion
def rrf(rankings, k=60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
```
### Tanimoto (binary fingerprint)
```python
def tanimoto(a: np.ndarray, b: np.ndarray) -> float:
inter = np.sum(a & b)
union = np.sum(a | b)
return inter / union if union else 0.0
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Dense embedding | Cosine (or normalized dot) |
| K-means / GMM | Euclidean |
| Token / set overlap | Jaccard |
| String fuzzy match | Levenshtein / Jaro-Winkler |
| Binary fingerprint | Hamming / Tanimoto |
| Large-scale ANN | HNSW (cosine) or IVF-PQ |
**기본값**: normalized embedding + cosine + HNSW.
## 🔗 Graph
- 부모: [[Embeddings]] · [[Vector-Search]]
- 응용: [[Semantic Search|Semantic-Search]] · [[Deduplication]] · [[RAG]]
- Adjacent: [[Sentence-Transformers]] · [[FAISS]]
## 🤖 LLM 활용
**언제**: semantic similarity, paraphrase detection, dedup of LLM outputs, eval (semantic equivalence).
**언제 X**: exact match required, ordinal / numeric distance — use direct comparison.
## ❌ 안티패턴
- **Unnormalized cosine**: 매 forgetting normalization → magnitude bias.
- **L2 on sparse high-D**: 매 curse of dimensionality — cosine more robust.
- **Single metric**: 매 hybrid (sparse + dense) 의 better recall.
- **Brute force at scale**: >1M vectors 의 ANN required.
## 🧪 검증 / 중복
- Verified (FAISS docs, sentence-transformers, rapidfuzz).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — full content with metric patterns + hybrid retrieval |