Files
2nd/10_Wiki/Topics/AI_and_ML/Similarity-Metrics-in-AI.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.8 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-similarity-metrics-in-ai Similarity Metrics in AI 10_Wiki/Topics verified self
Similarity Measures
Distance Metrics
Vector Similarity
none A 0.9 applied
similarity
embeddings
retrieval
vector-search
2026-05-10 pending
language framework
python numpy/faiss/sentence-transformers

Similarity Metrics in AI

매 한 줄

"매 similarity 의 metric choice 는 매 retrieval / clustering / matching quality 의 결정". 매 cosine 의 dominant 의 dense embedding semantic search, 매 Jaccard 의 set overlap, 매 edit distance 의 string fuzzy matching. 매 2026 의 modern stack 의 normalized cosine + ANN (HNSW/IVF-PQ) 의 standard.

매 핵심

매 Vector metrics

  • Cosine similarity: dot(a,b) / (||a|| * ||b||) — 매 magnitude-invariant. 매 embedding 의 default.
  • Dot product: 매 normalized embedding 의 cosine 과 equivalent. 매 faster (no division).
  • Euclidean (L2): 매 raw distance. 매 cluster centroid / k-means 의 use.
  • Manhattan (L1): 매 robust to outliers. 매 sparse feature 의 use.

매 Set / String metrics

  • Jaccard: |A ∩ B| / |A B| — 매 set / token overlap.
  • Levenshtein (edit distance): 매 character-level fuzzy match.
  • Hamming: 매 fixed-length binary / hash 의 distance.
  • Tanimoto: 매 chemistry / fingerprint similarity.

매 응용

  1. Semantic search — sentence-transformer embedding + cosine + FAISS HNSW.
  2. Deduplication — MinHash + Jaccard 의 near-duplicate detection.
  3. Recommendation — user/item embedding cosine.
  4. Fuzzy matching — record linkage 의 Levenshtein / Jaro-Winkler.

💻 패턴

Cosine similarity (numpy)

import numpy as np

def cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-12))

def cosine_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    A_n = A / (np.linalg.norm(A, axis=1, keepdims=True) + 1e-12)
    B_n = B / (np.linalg.norm(B, axis=1, keepdims=True) + 1e-12)
    return A_n @ B_n.T

Sentence embedding + FAISS (2026 stack)

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("BAAI/bge-large-en-v1.5")
docs = ["alpha doc", "beta doc", "gamma doc"]
emb = model.encode(docs, normalize_embeddings=True).astype("float32")

index = faiss.IndexHNSWFlat(emb.shape[1], 32)
index.metric_type = faiss.METRIC_INNER_PRODUCT  # cosine via normalized
index.add(emb)

q = model.encode(["alpha"], normalize_embeddings=True).astype("float32")
D, I = index.search(q, k=3)

Jaccard via MinHash (datasketch)

from datasketch import MinHash, MinHashLSH

def mh(tokens, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for t in tokens:
        m.update(t.encode("utf-8"))
    return m

lsh = MinHashLSH(threshold=0.7, num_perm=128)
lsh.insert("doc1", mh("the quick brown fox".split()))
lsh.insert("doc2", mh("the quick brown dog".split()))
print(lsh.query(mh("the quick brown fox jumps".split())))

Levenshtein (rapidfuzz)

from rapidfuzz.distance import Levenshtein
from rapidfuzz import fuzz, process

print(Levenshtein.distance("kitten", "sitting"))  # 3
print(fuzz.ratio("apple inc.", "apple, inc"))      # ~95

choices = ["Acme Corp", "Apple Inc.", "Microsoft"]
print(process.extractOne("aple", choices, scorer=fuzz.ratio))

Euclidean vs cosine (when matters)

# Cosine: angle only — magnitude ignored
a = np.array([1.0, 0.0]); b = np.array([10.0, 0.0])
# cosine(a,b) = 1.0 (identical direction)
# euclidean(a,b) = 9.0 (very different magnitude)

Hybrid retrieval (BM25 + dense)

# 매 modern RAG 의 default — sparse + dense fusion
from rank_bm25 import BM25Okapi
import numpy as np

tokenized = [d.split() for d in docs]
bm25 = BM25Okapi(tokenized)
sparse_scores = bm25.get_scores("alpha doc".split())
dense_scores = (emb @ q.T).flatten()

# Reciprocal Rank Fusion
def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

Tanimoto (binary fingerprint)

def tanimoto(a: np.ndarray, b: np.ndarray) -> float:
    inter = np.sum(a & b)
    union = np.sum(a | b)
    return inter / union if union else 0.0

매 결정 기준

상황 Approach
Dense embedding Cosine (or normalized dot)
K-means / GMM Euclidean
Token / set overlap Jaccard
String fuzzy match Levenshtein / Jaro-Winkler
Binary fingerprint Hamming / Tanimoto
Large-scale ANN HNSW (cosine) or IVF-PQ

기본값: normalized embedding + cosine + HNSW.

🔗 Graph

🤖 LLM 활용

언제: semantic similarity, paraphrase detection, dedup of LLM outputs, eval (semantic equivalence). 언제 X: exact match required, ordinal / numeric distance — use direct comparison.

안티패턴

  • Unnormalized cosine: 매 forgetting normalization → magnitude bias.
  • L2 on sparse high-D: 매 curse of dimensionality — cosine more robust.
  • Single metric: 매 hybrid (sparse + dense) 의 better recall.
  • Brute force at scale: >1M vectors 의 ANN required.

🧪 검증 / 중복

  • Verified (FAISS docs, sentence-transformers, rapidfuzz).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full content with metric patterns + hybrid retrieval