"매 similarity 의 metric choice 는 매 retrieval / clustering / matching quality 의 결정". 매 cosine 의 dominant 의 dense embedding semantic search, 매 Jaccard 의 set overlap, 매 edit distance 의 string fuzzy matching. 매 2026 의 modern stack 의 normalized cosine + ANN (HNSW/IVF-PQ) 의 standard.
매 핵심
매 Vector metrics
Cosine similarity: dot(a,b) / (||a|| * ||b||) — 매 magnitude-invariant. 매 embedding 의 default.
Dot product: 매 normalized embedding 의 cosine 과 equivalent. 매 faster (no division).
Euclidean (L2): 매 raw distance. 매 cluster centroid / k-means 의 use.
Manhattan (L1): 매 robust to outliers. 매 sparse feature 의 use.
매 Set / String metrics
Jaccard: |A ∩ B| / |A ∪ B| — 매 set / token overlap.
Levenshtein (edit distance): 매 character-level fuzzy match.
fromsentence_transformersimportSentenceTransformerimportfaissimportnumpyasnpmodel=SentenceTransformer("BAAI/bge-large-en-v1.5")docs=["alpha doc","beta doc","gamma doc"]emb=model.encode(docs,normalize_embeddings=True).astype("float32")index=faiss.IndexHNSWFlat(emb.shape[1],32)index.metric_type=faiss.METRIC_INNER_PRODUCT# cosine via normalizedindex.add(emb)q=model.encode(["alpha"],normalize_embeddings=True).astype("float32")D,I=index.search(q,k=3)
Jaccard via MinHash (datasketch)
fromdatasketchimportMinHash,MinHashLSHdefmh(tokens,num_perm=128):m=MinHash(num_perm=num_perm)fortintokens:m.update(t.encode("utf-8"))returnmlsh=MinHashLSH(threshold=0.7,num_perm=128)lsh.insert("doc1",mh("the quick brown fox".split()))lsh.insert("doc2",mh("the quick brown dog".split()))print(lsh.query(mh("the quick brown fox jumps".split())))
# Cosine: angle only — magnitude ignoreda=np.array([1.0,0.0]);b=np.array([10.0,0.0])# cosine(a,b) = 1.0 (identical direction)# euclidean(a,b) = 9.0 (very different magnitude)
Hybrid retrieval (BM25 + dense)
# 매 modern RAG 의 default — sparse + dense fusionfromrank_bm25importBM25Okapiimportnumpyasnptokenized=[d.split()fordindocs]bm25=BM25Okapi(tokenized)sparse_scores=bm25.get_scores("alpha doc".split())dense_scores=(emb@q.T).flatten()# Reciprocal Rank Fusiondefrrf(rankings,k=60):scores={}forrankinginrankings:forrank,doc_idinenumerate(ranking):scores[doc_id]=scores.get(doc_id,0)+1/(k+rank)returnsorted(scores.items(),key=lambdax:-x[1])
언제: semantic similarity, paraphrase detection, dedup of LLM outputs, eval (semantic equivalence).
언제 X: exact match required, ordinal / numeric distance — use direct comparison.
❌ 안티패턴
Unnormalized cosine: 매 forgetting normalization → magnitude bias.
L2 on sparse high-D: 매 curse of dimensionality — cosine more robust.
Single metric: 매 hybrid (sparse + dense) 의 better recall.