d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.8 KiB
5.8 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-similarity-metrics | Similarity Metrics | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
Similarity Metrics
매 한 줄
"매 두 vector 의 close 의 measure". Classical IR (Salton, 1970s) → modern embedding-based retrieval (RAG with Claude Opus 4.7, 2026) — 매 cosine 의 default, 매 task-specific tuning 의 critical.
매 핵심
매 주요 metrics
- Cosine:
cos(a,b) = (a·b)/(||a|| ||b||)∈ [-1, 1]. Direction only, scale-invariant. - Dot product:
a·b. Scale-sensitive — large norm dominates. - Euclidean (L2):
||a-b||₂. Geometric distance, sensitive to magnitude. - Manhattan (L1):
Σ|aᵢ-bᵢ|. Robust to outliers. - Jaccard:
|A∩B|/|A∪B|. Set similarity. - Hamming: count of differing positions. Binary vectors.
- Edit (Levenshtein): min insertions/deletions/substitutions. Strings.
매 핵심 관계
- Normalized vectors (||v||=1): cosine = dot product =
1 - L2²/2. - 매 modern embedding model (OpenAI text-embedding-3, voyage-3, BGE-M3) 의 normalized output → cosine ≡ dot.
매 응용
- RAG retrieval (text embeddings + cosine).
- Image search (CLIP embeddings).
- Recommender systems (item-item).
- Deduplication (near-duplicate detection).
- Clustering (k-means uses Euclidean).
💻 패턴
Cosine similarity
import numpy as np
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Batch
def cosine_matrix(A, B):
A_norm = A / np.linalg.norm(A, axis=1, keepdims=True)
B_norm = B / np.linalg.norm(B, axis=1, keepdims=True)
return A_norm @ B_norm.T
Sklearn pairwise
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
sim = cosine_similarity(X, Y) # (n_X, n_Y)
dist = euclidean_distances(X, Y)
FAISS (large-scale)
import faiss
import numpy as np
embeddings = np.random.randn(1_000_000, 768).astype('float32')
faiss.normalize_L2(embeddings) # for cosine via inner product
index = faiss.IndexFlatIP(768) # inner product = cosine on normalized
index.add(embeddings)
query = np.random.randn(1, 768).astype('float32')
faiss.normalize_L2(query)
D, I = index.search(query, k=10) # top-10
RAG with embeddings (2026)
from anthropic import Anthropic
import voyageai
vo = voyageai.Client()
client = Anthropic()
def embed(texts):
r = vo.embed(texts, model="voyage-3", input_type="document")
return np.array(r.embeddings)
docs_emb = embed(corpus)
q_emb = vo.embed([query], model="voyage-3", input_type="query").embeddings[0]
scores = docs_emb @ np.array(q_emb) # cosine on normalized
top_k = np.argsort(scores)[-5:][::-1]
Jaccard for sets
def jaccard(a: set, b: set) -> float:
if not a and not b:
return 1.0
return len(a & b) / len(a | b)
# MinHash for approximate Jaccard at scale
from datasketch import MinHash
m1, m2 = MinHash(), MinHash()
for w in doc1.split(): m1.update(w.encode())
for w in doc2.split(): m2.update(w.encode())
print(m1.jaccard(m2))
Edit distance
def levenshtein(a, b):
if len(a) < len(b): a, b = b, a
prev = list(range(len(b) + 1))
for i, ca in enumerate(a, 1):
curr = [i]
for j, cb in enumerate(b, 1):
curr.append(min(
curr[-1] + 1,
prev[j] + 1,
prev[j-1] + (ca != cb)
))
prev = curr
return prev[-1]
Hybrid search (BM25 + dense)
# Reciprocal Rank Fusion
def rrf(rankings, k=60):
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
bm25_top = bm25_search(query)
dense_top = vector_search(query)
fused = rrf([bm25_top, dense_top])
매 결정 기준
| 상황 | Metric |
|---|---|
| Text embeddings (LLM, BERT) | Cosine |
| Image embeddings (CLIP) | Cosine |
| Tabular numeric features | Euclidean (after StandardScaler) |
| Sparse binary features | Jaccard |
| Strings / typo-tolerant | Levenshtein |
| Hashes / fingerprints | Hamming |
| L1-sparse (Lasso embeddings) | Manhattan |
기본값: normalized embeddings + cosine (= dot product on unit vectors).
🔗 Graph
- 부모: Information Retrieval · Embeddings
- 응용: RAG · Recommender-Systems · Clustering
- Adjacent: FAISS · BM25 · Locality-Sensitive-Hashing (LSH)
🤖 LLM 활용
언제: RAG retrieval ranking. Semantic search. Deduplication. Clustering. Nearest-neighbor classification. Recommender similarity. 언제 X: Hierarchical / structured similarity (use tree edit distance, graph kernels). Causal similarity (use DTW for time-series).
❌ 안티패턴
- Cosine on un-normalized: 매 normalization 의 forget 시 —
normalize_L2의 explicit call. - Euclidean without scaling: feature 의 다른 scale 의 → larger-scale feature 가 dominate.
- Jaccard on dense vectors: 매 set similarity — dense float vec 에 X.
- Magnitude-blind cosine for ranking: cosine 의 direction only — popularity / confidence 의 X capture.
🧪 검증 / 중복
- Verified (Salton & McGill IR text, MTEB benchmark 2025, FAISS docs).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Similarity metrics with cosine/L2/Jaccard/edit, FAISS, RAG, hybrid |