f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
179 lines
5.8 KiB
Markdown
179 lines
5.8 KiB
Markdown
---
|
||
id: wiki-2026-0508-similarity-metrics-in-ai
|
||
title: Similarity Metrics in AI
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Similarity Measures, Distance Metrics, Vector Similarity]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.9
|
||
verification_status: applied
|
||
tags: [similarity, embeddings, retrieval, vector-search]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: python
|
||
framework: numpy/faiss/sentence-transformers
|
||
---
|
||
|
||
# Similarity Metrics in AI
|
||
|
||
## 매 한 줄
|
||
> **"매 similarity 의 metric choice 는 매 retrieval / clustering / matching quality 의 결정"**. 매 cosine 의 dominant 의 dense embedding semantic search, 매 Jaccard 의 set overlap, 매 edit distance 의 string fuzzy matching. 매 2026 의 modern stack 의 normalized cosine + ANN (HNSW/IVF-PQ) 의 standard.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 Vector metrics
|
||
- **Cosine similarity**: `dot(a,b) / (||a|| * ||b||)` — 매 magnitude-invariant. 매 embedding 의 default.
|
||
- **Dot product**: 매 normalized embedding 의 cosine 과 equivalent. 매 faster (no division).
|
||
- **Euclidean (L2)**: 매 raw distance. 매 cluster centroid / k-means 의 use.
|
||
- **Manhattan (L1)**: 매 robust to outliers. 매 sparse feature 의 use.
|
||
|
||
### 매 Set / String metrics
|
||
- **Jaccard**: `|A ∩ B| / |A ∪ B|` — 매 set / token overlap.
|
||
- **Levenshtein (edit distance)**: 매 character-level fuzzy match.
|
||
- **Hamming**: 매 fixed-length binary / hash 의 distance.
|
||
- **Tanimoto**: 매 chemistry / fingerprint similarity.
|
||
|
||
### 매 응용
|
||
1. **Semantic search** — sentence-transformer embedding + cosine + FAISS HNSW.
|
||
2. **Deduplication** — MinHash + Jaccard 의 near-duplicate detection.
|
||
3. **Recommendation** — user/item embedding cosine.
|
||
4. **Fuzzy matching** — record linkage 의 Levenshtein / Jaro-Winkler.
|
||
|
||
## 💻 패턴
|
||
|
||
### Cosine similarity (numpy)
|
||
```python
|
||
import numpy as np
|
||
|
||
def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
|
||
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12))
|
||
|
||
def cosine_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
|
||
A_n = A / (np.linalg.norm(A, axis=1, keepdims=True) + 1e-12)
|
||
B_n = B / (np.linalg.norm(B, axis=1, keepdims=True) + 1e-12)
|
||
return A_n @ B_n.T
|
||
```
|
||
|
||
### Sentence embedding + FAISS (2026 stack)
|
||
```python
|
||
from sentence_transformers import SentenceTransformer
|
||
import faiss
|
||
import numpy as np
|
||
|
||
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
|
||
docs = ["alpha doc", "beta doc", "gamma doc"]
|
||
emb = model.encode(docs, normalize_embeddings=True).astype("float32")
|
||
|
||
index = faiss.IndexHNSWFlat(emb.shape[1], 32)
|
||
index.metric_type = faiss.METRIC_INNER_PRODUCT # cosine via normalized
|
||
index.add(emb)
|
||
|
||
q = model.encode(["alpha"], normalize_embeddings=True).astype("float32")
|
||
D, I = index.search(q, k=3)
|
||
```
|
||
|
||
### Jaccard via MinHash (datasketch)
|
||
```python
|
||
from datasketch import MinHash, MinHashLSH
|
||
|
||
def mh(tokens, num_perm=128):
|
||
m = MinHash(num_perm=num_perm)
|
||
for t in tokens:
|
||
m.update(t.encode("utf-8"))
|
||
return m
|
||
|
||
lsh = MinHashLSH(threshold=0.7, num_perm=128)
|
||
lsh.insert("doc1", mh("the quick brown fox".split()))
|
||
lsh.insert("doc2", mh("the quick brown dog".split()))
|
||
print(lsh.query(mh("the quick brown fox jumps".split())))
|
||
```
|
||
|
||
### Levenshtein (rapidfuzz)
|
||
```python
|
||
from rapidfuzz.distance import Levenshtein
|
||
from rapidfuzz import fuzz, process
|
||
|
||
print(Levenshtein.distance("kitten", "sitting")) # 3
|
||
print(fuzz.ratio("apple inc.", "apple, inc")) # ~95
|
||
|
||
choices = ["Acme Corp", "Apple Inc.", "Microsoft"]
|
||
print(process.extractOne("aple", choices, scorer=fuzz.ratio))
|
||
```
|
||
|
||
### Euclidean vs cosine (when matters)
|
||
```python
|
||
# Cosine: angle only — magnitude ignored
|
||
a = np.array([1.0, 0.0]); b = np.array([10.0, 0.0])
|
||
# cosine(a,b) = 1.0 (identical direction)
|
||
# euclidean(a,b) = 9.0 (very different magnitude)
|
||
```
|
||
|
||
### Hybrid retrieval (BM25 + dense)
|
||
```python
|
||
# 매 modern RAG 의 default — sparse + dense fusion
|
||
from rank_bm25 import BM25Okapi
|
||
import numpy as np
|
||
|
||
tokenized = [d.split() for d in docs]
|
||
bm25 = BM25Okapi(tokenized)
|
||
sparse_scores = bm25.get_scores("alpha doc".split())
|
||
dense_scores = (emb @ q.T).flatten()
|
||
|
||
# Reciprocal Rank Fusion
|
||
def rrf(rankings, k=60):
|
||
scores = {}
|
||
for ranking in rankings:
|
||
for rank, doc_id in enumerate(ranking):
|
||
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
|
||
return sorted(scores.items(), key=lambda x: -x[1])
|
||
```
|
||
|
||
### Tanimoto (binary fingerprint)
|
||
```python
|
||
def tanimoto(a: np.ndarray, b: np.ndarray) -> float:
|
||
inter = np.sum(a & b)
|
||
union = np.sum(a | b)
|
||
return inter / union if union else 0.0
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| Dense embedding | Cosine (or normalized dot) |
|
||
| K-means / GMM | Euclidean |
|
||
| Token / set overlap | Jaccard |
|
||
| String fuzzy match | Levenshtein / Jaro-Winkler |
|
||
| Binary fingerprint | Hamming / Tanimoto |
|
||
| Large-scale ANN | HNSW (cosine) or IVF-PQ |
|
||
|
||
**기본값**: normalized embedding + cosine + HNSW.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Embeddings]] · [[Vector-Search]]
|
||
- 응용: [[Semantic Search|Semantic-Search]] · [[Deduplication]] · [[RAG]]
|
||
- Adjacent: [[Sentence-Transformers]] · [[FAISS]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: semantic similarity, paraphrase detection, dedup of LLM outputs, eval (semantic equivalence).
|
||
**언제 X**: exact match required, ordinal / numeric distance — use direct comparison.
|
||
|
||
## ❌ 안티패턴
|
||
- **Unnormalized cosine**: 매 forgetting normalization → magnitude bias.
|
||
- **L2 on sparse high-D**: 매 curse of dimensionality — cosine more robust.
|
||
- **Single metric**: 매 hybrid (sparse + dense) 의 better recall.
|
||
- **Brute force at scale**: >1M vectors 의 ANN required.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (FAISS docs, sentence-transformers, rapidfuzz).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — full content with metric patterns + hybrid retrieval |
|