f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.2 KiB
5.2 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-locality-sensitive-hashing-lsh | Locality-Sensitive Hashing (LSH) | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
Locality-Sensitive Hashing (LSH)
매 한 줄
"매 similar item 의 same bucket 으로의 hash". LSH 는 hash function family
\mathcal{H}가 매 distance-preserving — close points 는 collide, far points 는 separate. Indyk & Motwani (1998) 이 도입했고, 2026 에서는 ANN 의 매 classical baseline 이며 dedup, plagiarism, blocking, near-duplicate retrieval 의 매 default 로 여전히 사용 (HNSW 가 dominate 하지만 LSH 는 streaming/external memory 에 유리).
매 핵심
매 Definition
\mathcal{H} is $(r_1, r_2, p_1, p_2)$-sensitive iff:
d(x,y) \le r_1 \Rightarrow \Pr[h(x)=h(y)] \ge p_1d(x,y) \ge r_2 \Rightarrow \Pr[h(x)=h(y)] \le p_2r_1 < r_2,p_1 > p_2
매 Families
- MinHash: Jaccard distance — set similarity
- SimHash (random hyperplane): cosine — sign of
w^\top x - p-stable LSH:
\ell_pnorms (Datar 2004) - Cross-polytope: spherical distance (state-of-art)
매 Amplification
- AND:
g(x) = (h_1(x), \dots, h_k(x))— reducesp_2top_2^k - OR:
Ltables, query all → reduces miss rate - tune
(k, L)for target precision/recall
매 응용
- Dedup: web crawl near-dup pages (MinHash + LSH).
- Plagiarism: shingled MinHash.
- Blocking: entity resolution candidate generation.
- ANN: cosine NN (SimHash baseline).
- Genomics: sketch-based read alignment.
💻 패턴
MinHash + LSH for Jaccard
from datasketch import MinHash, MinHashLSH
def shingles(text, k=5):
return {text[i:i+k] for i in range(len(text)-k+1)}
def make_mh(s, num_perm=128):
m = MinHash(num_perm=num_perm)
for sh in s: m.update(sh.encode())
return m
lsh = MinHashLSH(threshold=0.7, num_perm=128)
docs = {"d1": "the quick brown fox", "d2": "the quick red fox"}
mhs = {k: make_mh(shingles(v)) for k, v in docs.items()}
for k, m in mhs.items(): lsh.insert(k, m)
q = make_mh(shingles("the quick brown fox jumps"))
print(lsh.query(q)) # candidate set
SimHash for cosine
import numpy as np
from collections import defaultdict
def simhash(x, planes):
return tuple((x @ planes.T > 0).astype(np.int8))
# k random hyperplanes
d, k, L = 128, 8, 10
tables = [np.random.randn(k, d) for _ in range(L)]
def index(X):
out = [defaultdict(list) for _ in range(L)]
for i, x in enumerate(X):
for li, planes in enumerate(tables):
out[li][simhash(x, planes)].append(i)
return out
def query(q, idx):
cands = set()
for li, planes in enumerate(tables):
cands |= set(idx[li].get(simhash(q, planes), []))
return cands
p-stable LSH (L2)
# h(x) = floor((a·x + b) / w), a ~ N(0, I), b ~ U[0, w]
def make_l2_lsh(d, w=4.0, k=8):
a = np.random.randn(k, d)
b = np.random.uniform(0, w, k)
return lambda x: tuple(np.floor((a @ x + b) / w).astype(np.int64))
LSH Forest (multi-resolution)
from datasketch import MinHashLSHForest
forest = MinHashLSHForest(num_perm=128)
for k, m in mhs.items(): forest.add(k, m)
forest.index()
print(forest.query(q, 5)) # top-5 approx Jaccard NN
Banding technique (k-AND, L-OR)
def banded_lsh(signatures, k_per_band, L_bands):
# signatures: (n, k_per_band * L_bands)
buckets = [defaultdict(list) for _ in range(L_bands)]
for i, sig in enumerate(signatures):
for b in range(L_bands):
band = tuple(sig[b*k_per_band:(b+1)*k_per_band])
buckets[b][band].append(i)
return buckets
매 결정 기준
| Distance | Family |
|---|---|
| Jaccard (sets) | MinHash |
| Cosine | SimHash / Cross-polytope |
\ell_2 |
p-stable (Gaussian) |
| Hamming | bit-sampling |
| edit distance | shingle + MinHash approx |
기본값: HNSW for general ANN (faster); LSH for dedup, streaming, external memory, exact-recall guarantee.
🔗 Graph
- 부모: Approximate-Nearest-Neighbor
- 변형: MinHash · SimHash
- 응용: Deduplication
- Adjacent: HNSW · Faiss
🤖 LLM 활용
언제: massive corpus dedup (e.g. pretraining cleanup), candidate blocking, streaming. 언제 X: small (n < 10⁵) 또는 high-precision recall — HNSW/IVF 가 더 빠름.
❌ 안티패턴
- (k, L) tuning 무시: default 사용 → too many false positives or misses.
- Wrong family: cosine 인데 MinHash 사용 → meaningless.
- Re-hash on every query: index 재build → use persistent lib (datasketch, faiss).
- Treating LSH as exact: 매 approximate — verify candidates with true distance.
🧪 검증 / 중복
- Verified (Indyk & Motwani 1998 STOC, Andoni & Indyk 2008 CACM, Leskovec MMDS textbook ch 3).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — LSH families, MinHash/SimHash, banding, dedup patterns |