--- id: wiki-2026-0508-locality-sensitive-hashing-lsh title: Locality-Sensitive Hashing (LSH) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [LSH, Approximate Nearest Neighbor, MinHash, SimHash] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [hashing, ann, retrieval, similarity-search] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: datasketch, faiss --- # Locality-Sensitive Hashing (LSH) ## 매 한 줄 > **"매 similar item 의 same bucket 으로의 hash"**. LSH 는 hash function family $\mathcal{H}$ 가 매 distance-preserving — close points 는 collide, far points 는 separate. Indyk & Motwani (1998) 이 도입했고, 2026 에서는 ANN 의 매 classical baseline 이며 dedup, plagiarism, blocking, near-duplicate retrieval 의 매 default 로 여전히 사용 (HNSW 가 dominate 하지만 LSH 는 streaming/external memory 에 유리). ## 매 핵심 ### 매 Definition $\mathcal{H}$ is $(r_1, r_2, p_1, p_2)$-sensitive iff: - $d(x,y) \le r_1 \Rightarrow \Pr[h(x)=h(y)] \ge p_1$ - $d(x,y) \ge r_2 \Rightarrow \Pr[h(x)=h(y)] \le p_2$ - $r_1 < r_2$, $p_1 > p_2$ ### 매 Families - **MinHash**: Jaccard distance — set similarity - **SimHash (random hyperplane)**: cosine — sign of $w^\top x$ - **p-stable LSH**: $\ell_p$ norms (Datar 2004) - **Cross-polytope**: spherical distance (state-of-art) ### 매 Amplification - AND: $g(x) = (h_1(x), \dots, h_k(x))$ — reduces $p_2$ to $p_2^k$ - OR: $L$ tables, query all → reduces miss rate - tune $(k, L)$ for target precision/recall ### 매 응용 1. **Dedup**: web crawl near-dup pages (MinHash + LSH). 2. **Plagiarism**: shingled MinHash. 3. **Blocking**: entity resolution candidate generation. 4. **ANN**: cosine NN (SimHash baseline). 5. **Genomics**: sketch-based read alignment. ## 💻 패턴 ### MinHash + LSH for Jaccard ```python from datasketch import MinHash, MinHashLSH def shingles(text, k=5): return {text[i:i+k] for i in range(len(text)-k+1)} def make_mh(s, num_perm=128): m = MinHash(num_perm=num_perm) for sh in s: m.update(sh.encode()) return m lsh = MinHashLSH(threshold=0.7, num_perm=128) docs = {"d1": "the quick brown fox", "d2": "the quick red fox"} mhs = {k: make_mh(shingles(v)) for k, v in docs.items()} for k, m in mhs.items(): lsh.insert(k, m) q = make_mh(shingles("the quick brown fox jumps")) print(lsh.query(q)) # candidate set ``` ### SimHash for cosine ```python import numpy as np from collections import defaultdict def simhash(x, planes): return tuple((x @ planes.T > 0).astype(np.int8)) # k random hyperplanes d, k, L = 128, 8, 10 tables = [np.random.randn(k, d) for _ in range(L)] def index(X): out = [defaultdict(list) for _ in range(L)] for i, x in enumerate(X): for li, planes in enumerate(tables): out[li][simhash(x, planes)].append(i) return out def query(q, idx): cands = set() for li, planes in enumerate(tables): cands |= set(idx[li].get(simhash(q, planes), [])) return cands ``` ### p-stable LSH (L2) ```python # h(x) = floor((a·x + b) / w), a ~ N(0, I), b ~ U[0, w] def make_l2_lsh(d, w=4.0, k=8): a = np.random.randn(k, d) b = np.random.uniform(0, w, k) return lambda x: tuple(np.floor((a @ x + b) / w).astype(np.int64)) ``` ### LSH Forest (multi-resolution) ```python from datasketch import MinHashLSHForest forest = MinHashLSHForest(num_perm=128) for k, m in mhs.items(): forest.add(k, m) forest.index() print(forest.query(q, 5)) # top-5 approx Jaccard NN ``` ### Banding technique (k-AND, L-OR) ```python def banded_lsh(signatures, k_per_band, L_bands): # signatures: (n, k_per_band * L_bands) buckets = [defaultdict(list) for _ in range(L_bands)] for i, sig in enumerate(signatures): for b in range(L_bands): band = tuple(sig[b*k_per_band:(b+1)*k_per_band]) buckets[b][band].append(i) return buckets ``` ## 매 결정 기준 | Distance | Family | |---|---| | Jaccard (sets) | MinHash | | Cosine | SimHash / Cross-polytope | | $\ell_2$ | p-stable (Gaussian) | | Hamming | bit-sampling | | edit distance | shingle + MinHash approx | **기본값**: HNSW for general ANN (faster); LSH for dedup, streaming, external memory, exact-recall guarantee. ## 🔗 Graph - 부모: [[Approximate-Nearest-Neighbor]] - 변형: [[MinHash]] · [[SimHash]] - 응용: [[Deduplication]] - Adjacent: [[HNSW]] · [[Faiss]] ## 🤖 LLM 활용 **언제**: massive corpus dedup (e.g. pretraining cleanup), candidate blocking, streaming. **언제 X**: small (n < 10⁵) 또는 high-precision recall — HNSW/IVF 가 더 빠름. ## ❌ 안티패턴 - **(k, L) tuning 무시**: default 사용 → too many false positives or misses. - **Wrong family**: cosine 인데 MinHash 사용 → meaningless. - **Re-hash on every query**: index 재build → use persistent lib (datasketch, faiss). - **Treating LSH as exact**: 매 approximate — verify candidates with true distance. ## 🧪 검증 / 중복 - Verified (Indyk & Motwani 1998 STOC, Andoni & Indyk 2008 CACM, Leskovec MMDS textbook ch 3). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — LSH families, MinHash/SimHash, banding, dedup patterns |