--- id: wiki-2026-0508-similarity-metrics title: Similarity Metrics category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Distance Metrics, Similarity Functions] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [machine-learning, retrieval, embeddings, distance] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: NumPy/FAISS --- # Similarity Metrics ## 매 한 줄 > **"매 두 vector 의 close 의 measure"**. Classical IR (Salton, 1970s) → modern embedding-based retrieval (RAG with Claude Opus 4.7, 2026) — 매 cosine 의 default, 매 task-specific tuning 의 critical. ## 매 핵심 ### 매 주요 metrics - **Cosine**: `cos(a,b) = (a·b)/(||a|| ||b||)` ∈ [-1, 1]. Direction only, scale-invariant. - **Dot product**: `a·b`. Scale-sensitive — large norm dominates. - **Euclidean (L2)**: `||a-b||₂`. Geometric distance, sensitive to magnitude. - **Manhattan (L1)**: `Σ|aᵢ-bᵢ|`. Robust to outliers. - **Jaccard**: `|A∩B|/|A∪B|`. Set similarity. - **Hamming**: count of differing positions. Binary vectors. - **Edit (Levenshtein)**: min insertions/deletions/substitutions. Strings. ### 매 핵심 관계 - Normalized vectors (||v||=1): cosine = dot product = `1 - L2²/2`. - 매 modern embedding model (OpenAI text-embedding-3, voyage-3, BGE-M3) 의 normalized output → cosine ≡ dot. ### 매 응용 1. RAG retrieval (text embeddings + cosine). 2. Image search (CLIP embeddings). 3. Recommender systems (item-item). 4. Deduplication (near-duplicate detection). 5. Clustering (k-means uses Euclidean). ## 💻 패턴 ### Cosine similarity ```python import numpy as np def cosine(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Batch def cosine_matrix(A, B): A_norm = A / np.linalg.norm(A, axis=1, keepdims=True) B_norm = B / np.linalg.norm(B, axis=1, keepdims=True) return A_norm @ B_norm.T ``` ### Sklearn pairwise ```python from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances sim = cosine_similarity(X, Y) # (n_X, n_Y) dist = euclidean_distances(X, Y) ``` ### FAISS (large-scale) ```python import faiss import numpy as np embeddings = np.random.randn(1_000_000, 768).astype('float32') faiss.normalize_L2(embeddings) # for cosine via inner product index = faiss.IndexFlatIP(768) # inner product = cosine on normalized index.add(embeddings) query = np.random.randn(1, 768).astype('float32') faiss.normalize_L2(query) D, I = index.search(query, k=10) # top-10 ``` ### RAG with embeddings (2026) ```python from anthropic import Anthropic import voyageai vo = voyageai.Client() client = Anthropic() def embed(texts): r = vo.embed(texts, model="voyage-3", input_type="document") return np.array(r.embeddings) docs_emb = embed(corpus) q_emb = vo.embed([query], model="voyage-3", input_type="query").embeddings[0] scores = docs_emb @ np.array(q_emb) # cosine on normalized top_k = np.argsort(scores)[-5:][::-1] ``` ### Jaccard for sets ```python def jaccard(a: set, b: set) -> float: if not a and not b: return 1.0 return len(a & b) / len(a | b) # MinHash for approximate Jaccard at scale from datasketch import MinHash m1, m2 = MinHash(), MinHash() for w in doc1.split(): m1.update(w.encode()) for w in doc2.split(): m2.update(w.encode()) print(m1.jaccard(m2)) ``` ### Edit distance ```python def levenshtein(a, b): if len(a) < len(b): a, b = b, a prev = list(range(len(b) + 1)) for i, ca in enumerate(a, 1): curr = [i] for j, cb in enumerate(b, 1): curr.append(min( curr[-1] + 1, prev[j] + 1, prev[j-1] + (ca != cb) )) prev = curr return prev[-1] ``` ### Hybrid search (BM25 + dense) ```python # Reciprocal Rank Fusion def rrf(rankings, k=60): scores = {} for ranking in rankings: for rank, doc_id in enumerate(ranking): scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank) return sorted(scores.items(), key=lambda x: -x[1]) bm25_top = bm25_search(query) dense_top = vector_search(query) fused = rrf([bm25_top, dense_top]) ``` ## 매 결정 기준 | 상황 | Metric | |---|---| | Text embeddings (LLM, BERT) | Cosine | | Image embeddings (CLIP) | Cosine | | Tabular numeric features | Euclidean (after StandardScaler) | | Sparse binary features | Jaccard | | Strings / typo-tolerant | Levenshtein | | Hashes / fingerprints | Hamming | | L1-sparse (Lasso embeddings) | Manhattan | **기본값**: normalized embeddings + cosine (= dot product on unit vectors). ## 🔗 Graph - 부모: [[Information Retrieval]] · [[Embeddings]] - 응용: [[RAG]] · [[Recommender-Systems]] · [[Clustering]] - Adjacent: [[FAISS]] · [[BM25]] · [[Locality-Sensitive-Hashing (LSH)|Locality-Sensitive-Hashing]] ## 🤖 LLM 활용 **언제**: RAG retrieval ranking. Semantic search. Deduplication. Clustering. Nearest-neighbor classification. Recommender similarity. **언제 X**: Hierarchical / structured similarity (use tree edit distance, graph kernels). Causal similarity (use DTW for time-series). ## ❌ 안티패턴 - **Cosine on un-normalized**: 매 normalization 의 forget 시 — `normalize_L2` 의 explicit call. - **Euclidean without scaling**: feature 의 다른 scale 의 → larger-scale feature 가 dominate. - **Jaccard on dense vectors**: 매 set similarity — dense float vec 에 X. - **Magnitude-blind cosine for ranking**: cosine 의 direction only — popularity / confidence 의 X capture. ## 🧪 검증 / 중복 - Verified (Salton & McGill IR text, MTEB benchmark 2025, FAISS docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Similarity metrics with cosine/L2/Jaccard/edit, FAISS, RAG, hybrid |