Files
2nd/10_Wiki/Topics/Computer_Science_and_Theory/Similarity-Metrics.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

5.8 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-similarity-metrics Similarity Metrics 10_Wiki/Topics verified self
Distance Metrics
Similarity Functions
none A 0.9 applied
machine-learning
retrieval
embeddings
distance
2026-05-10 pending
language framework
Python NumPy/FAISS

Similarity Metrics

매 한 줄

"매 두 vector 의 close 의 measure". Classical IR (Salton, 1970s) → modern embedding-based retrieval (RAG with Claude Opus 4.7, 2026) — 매 cosine 의 default, 매 task-specific tuning 의 critical.

매 핵심

매 주요 metrics

  • Cosine: cos(a,b) = (a·b)/(||a|| ||b||) ∈ [-1, 1]. Direction only, scale-invariant.
  • Dot product: a·b. Scale-sensitive — large norm dominates.
  • Euclidean (L2): ||a-b||₂. Geometric distance, sensitive to magnitude.
  • Manhattan (L1): Σ|aᵢ-bᵢ|. Robust to outliers.
  • Jaccard: |A∩B|/|AB|. Set similarity.
  • Hamming: count of differing positions. Binary vectors.
  • Edit (Levenshtein): min insertions/deletions/substitutions. Strings.

매 핵심 관계

  • Normalized vectors (||v||=1): cosine = dot product = 1 - L2²/2.
  • 매 modern embedding model (OpenAI text-embedding-3, voyage-3, BGE-M3) 의 normalized output → cosine ≡ dot.

매 응용

  1. RAG retrieval (text embeddings + cosine).
  2. Image search (CLIP embeddings).
  3. Recommender systems (item-item).
  4. Deduplication (near-duplicate detection).
  5. Clustering (k-means uses Euclidean).

💻 패턴

Cosine similarity

import numpy as np

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Batch
def cosine_matrix(A, B):
    A_norm = A / np.linalg.norm(A, axis=1, keepdims=True)
    B_norm = B / np.linalg.norm(B, axis=1, keepdims=True)
    return A_norm @ B_norm.T

Sklearn pairwise

from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

sim = cosine_similarity(X, Y)  # (n_X, n_Y)
dist = euclidean_distances(X, Y)

FAISS (large-scale)

import faiss
import numpy as np

embeddings = np.random.randn(1_000_000, 768).astype('float32')
faiss.normalize_L2(embeddings)  # for cosine via inner product

index = faiss.IndexFlatIP(768)  # inner product = cosine on normalized
index.add(embeddings)

query = np.random.randn(1, 768).astype('float32')
faiss.normalize_L2(query)
D, I = index.search(query, k=10)  # top-10

RAG with embeddings (2026)

from anthropic import Anthropic
import voyageai

vo = voyageai.Client()
client = Anthropic()

def embed(texts):
    r = vo.embed(texts, model="voyage-3", input_type="document")
    return np.array(r.embeddings)

docs_emb = embed(corpus)
q_emb = vo.embed([query], model="voyage-3", input_type="query").embeddings[0]

scores = docs_emb @ np.array(q_emb)  # cosine on normalized
top_k = np.argsort(scores)[-5:][::-1]

Jaccard for sets

def jaccard(a: set, b: set) -> float:
    if not a and not b:
        return 1.0
    return len(a & b) / len(a | b)

# MinHash for approximate Jaccard at scale
from datasketch import MinHash
m1, m2 = MinHash(), MinHash()
for w in doc1.split(): m1.update(w.encode())
for w in doc2.split(): m2.update(w.encode())
print(m1.jaccard(m2))

Edit distance

def levenshtein(a, b):
    if len(a) < len(b): a, b = b, a
    prev = list(range(len(b) + 1))
    for i, ca in enumerate(a, 1):
        curr = [i]
        for j, cb in enumerate(b, 1):
            curr.append(min(
                curr[-1] + 1,
                prev[j] + 1,
                prev[j-1] + (ca != cb)
            ))
        prev = curr
    return prev[-1]

Hybrid search (BM25 + dense)

# Reciprocal Rank Fusion
def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

bm25_top = bm25_search(query)
dense_top = vector_search(query)
fused = rrf([bm25_top, dense_top])

매 결정 기준

상황 Metric
Text embeddings (LLM, BERT) Cosine
Image embeddings (CLIP) Cosine
Tabular numeric features Euclidean (after StandardScaler)
Sparse binary features Jaccard
Strings / typo-tolerant Levenshtein
Hashes / fingerprints Hamming
L1-sparse (Lasso embeddings) Manhattan

기본값: normalized embeddings + cosine (= dot product on unit vectors).

🔗 Graph

🤖 LLM 활용

언제: RAG retrieval ranking. Semantic search. Deduplication. Clustering. Nearest-neighbor classification. Recommender similarity. 언제 X: Hierarchical / structured similarity (use tree edit distance, graph kernels). Causal similarity (use DTW for time-series).

안티패턴

  • Cosine on un-normalized: 매 normalization 의 forget 시 — normalize_L2 의 explicit call.
  • Euclidean without scaling: feature 의 다른 scale 의 → larger-scale feature 가 dominate.
  • Jaccard on dense vectors: 매 set similarity — dense float vec 에 X.
  • Magnitude-blind cosine for ranking: cosine 의 direction only — popularity / confidence 의 X capture.

🧪 검증 / 중복

  • Verified (Salton & McGill IR text, MTEB benchmark 2025, FAISS docs).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Similarity metrics with cosine/L2/Jaccard/edit, FAISS, RAG, hybrid