--- id: wiki-2026-0508-similarity-metrics-in-ai title: Similarity Metrics in AI category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Similarity Measures, Distance Metrics, Vector Similarity] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [similarity, embeddings, retrieval, vector-search] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: numpy/faiss/sentence-transformers --- # Similarity Metrics in AI ## 매 한 줄 > **"매 similarity 의 metric choice 는 매 retrieval / clustering / matching quality 의 결정"**. 매 cosine 의 dominant 의 dense embedding semantic search, 매 Jaccard 의 set overlap, 매 edit distance 의 string fuzzy matching. 매 2026 의 modern stack 의 normalized cosine + ANN (HNSW/IVF-PQ) 의 standard. ## 매 핵심 ### 매 Vector metrics - **Cosine similarity**: `dot(a,b) / (||a|| * ||b||)` — 매 magnitude-invariant. 매 embedding 의 default. - **Dot product**: 매 normalized embedding 의 cosine 과 equivalent. 매 faster (no division). - **Euclidean (L2)**: 매 raw distance. 매 cluster centroid / k-means 의 use. - **Manhattan (L1)**: 매 robust to outliers. 매 sparse feature 의 use. ### 매 Set / String metrics - **Jaccard**: `|A ∩ B| / |A ∪ B|` — 매 set / token overlap. - **Levenshtein (edit distance)**: 매 character-level fuzzy match. - **Hamming**: 매 fixed-length binary / hash 의 distance. - **Tanimoto**: 매 chemistry / fingerprint similarity. ### 매 응용 1. **Semantic search** — sentence-transformer embedding + cosine + FAISS HNSW. 2. **Deduplication** — MinHash + Jaccard 의 near-duplicate detection. 3. **Recommendation** — user/item embedding cosine. 4. **Fuzzy matching** — record linkage 의 Levenshtein / Jaro-Winkler. ## 💻 패턴 ### Cosine similarity (numpy) ```python import numpy as np def cosine_sim(a: np.ndarray, b: np.ndarray) -> float: return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12)) def cosine_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray: A_n = A / (np.linalg.norm(A, axis=1, keepdims=True) + 1e-12) B_n = B / (np.linalg.norm(B, axis=1, keepdims=True) + 1e-12) return A_n @ B_n.T ``` ### Sentence embedding + FAISS (2026 stack) ```python from sentence_transformers import SentenceTransformer import faiss import numpy as np model = SentenceTransformer("BAAI/bge-large-en-v1.5") docs = ["alpha doc", "beta doc", "gamma doc"] emb = model.encode(docs, normalize_embeddings=True).astype("float32") index = faiss.IndexHNSWFlat(emb.shape[1], 32) index.metric_type = faiss.METRIC_INNER_PRODUCT # cosine via normalized index.add(emb) q = model.encode(["alpha"], normalize_embeddings=True).astype("float32") D, I = index.search(q, k=3) ``` ### Jaccard via MinHash (datasketch) ```python from datasketch import MinHash, MinHashLSH def mh(tokens, num_perm=128): m = MinHash(num_perm=num_perm) for t in tokens: m.update(t.encode("utf-8")) return m lsh = MinHashLSH(threshold=0.7, num_perm=128) lsh.insert("doc1", mh("the quick brown fox".split())) lsh.insert("doc2", mh("the quick brown dog".split())) print(lsh.query(mh("the quick brown fox jumps".split()))) ``` ### Levenshtein (rapidfuzz) ```python from rapidfuzz.distance import Levenshtein from rapidfuzz import fuzz, process print(Levenshtein.distance("kitten", "sitting")) # 3 print(fuzz.ratio("apple inc.", "apple, inc")) # ~95 choices = ["Acme Corp", "Apple Inc.", "Microsoft"] print(process.extractOne("aple", choices, scorer=fuzz.ratio)) ``` ### Euclidean vs cosine (when matters) ```python # Cosine: angle only — magnitude ignored a = np.array([1.0, 0.0]); b = np.array([10.0, 0.0]) # cosine(a,b) = 1.0 (identical direction) # euclidean(a,b) = 9.0 (very different magnitude) ``` ### Hybrid retrieval (BM25 + dense) ```python # 매 modern RAG 의 default — sparse + dense fusion from rank_bm25 import BM25Okapi import numpy as np tokenized = [d.split() for d in docs] bm25 = BM25Okapi(tokenized) sparse_scores = bm25.get_scores("alpha doc".split()) dense_scores = (emb @ q.T).flatten() # Reciprocal Rank Fusion def rrf(rankings, k=60): scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking): scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank) return sorted(scores.items(), key=lambda x: -x[1]) ``` ### Tanimoto (binary fingerprint) ```python def tanimoto(a: np.ndarray, b: np.ndarray) -> float: inter = np.sum(a & b) union = np.sum(a | b) return inter / union if union else 0.0 ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Dense embedding | Cosine (or normalized dot) | | K-means / GMM | Euclidean | | Token / set overlap | Jaccard | | String fuzzy match | Levenshtein / Jaro-Winkler | | Binary fingerprint | Hamming / Tanimoto | | Large-scale ANN | HNSW (cosine) or IVF-PQ | **기본값**: normalized embedding + cosine + HNSW. ## 🔗 Graph - 부모: [[Embeddings]] · [[Vector-Search]] - 응용: [[Semantic Search|Semantic-Search]] · [[Deduplication]] · [[RAG]] - Adjacent: [[Sentence-Transformers]] · [[FAISS]] ## 🤖 LLM 활용 **언제**: semantic similarity, paraphrase detection, dedup of LLM outputs, eval (semantic equivalence). **언제 X**: exact match required, ordinal / numeric distance — use direct comparison. ## ❌ 안티패턴 - **Unnormalized cosine**: 매 forgetting normalization → magnitude bias. - **L2 on sparse high-D**: 매 curse of dimensionality — cosine more robust. - **Single metric**: 매 hybrid (sparse + dense) 의 better recall. - **Brute force at scale**: >1M vectors 의 ANN required. ## 🧪 검증 / 중복 - Verified (FAISS docs, sentence-transformers, rapidfuzz). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full content with metric patterns + hybrid retrieval |