Files
2nd/10_Wiki/Topics/Computer_Science_and_Theory/Locality-Sensitive-Hashing (LSH).md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.2 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-locality-sensitive-hashing-lsh Locality-Sensitive Hashing (LSH) 10_Wiki/Topics verified self
LSH
Approximate Nearest Neighbor
MinHash
SimHash
none A 0.9 applied
hashing
ann
retrieval
similarity-search
2026-05-10 pending
language framework
python datasketch, faiss

Locality-Sensitive Hashing (LSH)

매 한 줄

"매 similar item 의 same bucket 으로의 hash". LSH 는 hash function family \mathcal{H} 가 매 distance-preserving — close points 는 collide, far points 는 separate. Indyk & Motwani (1998) 이 도입했고, 2026 에서는 ANN 의 매 classical baseline 이며 dedup, plagiarism, blocking, near-duplicate retrieval 의 매 default 로 여전히 사용 (HNSW 가 dominate 하지만 LSH 는 streaming/external memory 에 유리).

매 핵심

매 Definition

\mathcal{H} is $(r_1, r_2, p_1, p_2)$-sensitive iff:

  • d(x,y) \le r_1 \Rightarrow \Pr[h(x)=h(y)] \ge p_1
  • d(x,y) \ge r_2 \Rightarrow \Pr[h(x)=h(y)] \le p_2
  • r_1 < r_2, p_1 > p_2

매 Families

  • MinHash: Jaccard distance — set similarity
  • SimHash (random hyperplane): cosine — sign of w^\top x
  • p-stable LSH: \ell_p norms (Datar 2004)
  • Cross-polytope: spherical distance (state-of-art)

매 Amplification

  • AND: g(x) = (h_1(x), \dots, h_k(x)) — reduces p_2 to p_2^k
  • OR: L tables, query all → reduces miss rate
  • tune (k, L) for target precision/recall

매 응용

  1. Dedup: web crawl near-dup pages (MinHash + LSH).
  2. Plagiarism: shingled MinHash.
  3. Blocking: entity resolution candidate generation.
  4. ANN: cosine NN (SimHash baseline).
  5. Genomics: sketch-based read alignment.

💻 패턴

MinHash + LSH for Jaccard

from datasketch import MinHash, MinHashLSH

def shingles(text, k=5):
    return {text[i:i+k] for i in range(len(text)-k+1)}

def make_mh(s, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for sh in s: m.update(sh.encode())
    return m

lsh = MinHashLSH(threshold=0.7, num_perm=128)
docs = {"d1": "the quick brown fox", "d2": "the quick red fox"}
mhs = {k: make_mh(shingles(v)) for k, v in docs.items()}
for k, m in mhs.items(): lsh.insert(k, m)

q = make_mh(shingles("the quick brown fox jumps"))
print(lsh.query(q))  # candidate set

SimHash for cosine

import numpy as np
from collections import defaultdict

def simhash(x, planes):
    return tuple((x @ planes.T > 0).astype(np.int8))

# k random hyperplanes
d, k, L = 128, 8, 10
tables = [np.random.randn(k, d) for _ in range(L)]

def index(X):
    out = [defaultdict(list) for _ in range(L)]
    for i, x in enumerate(X):
        for li, planes in enumerate(tables):
            out[li][simhash(x, planes)].append(i)
    return out

def query(q, idx):
    cands = set()
    for li, planes in enumerate(tables):
        cands |= set(idx[li].get(simhash(q, planes), []))
    return cands

p-stable LSH (L2)

# h(x) = floor((a·x + b) / w), a ~ N(0, I), b ~ U[0, w]
def make_l2_lsh(d, w=4.0, k=8):
    a = np.random.randn(k, d)
    b = np.random.uniform(0, w, k)
    return lambda x: tuple(np.floor((a @ x + b) / w).astype(np.int64))

LSH Forest (multi-resolution)

from datasketch import MinHashLSHForest
forest = MinHashLSHForest(num_perm=128)
for k, m in mhs.items(): forest.add(k, m)
forest.index()
print(forest.query(q, 5))  # top-5 approx Jaccard NN

Banding technique (k-AND, L-OR)

def banded_lsh(signatures, k_per_band, L_bands):
    # signatures: (n, k_per_band * L_bands)
    buckets = [defaultdict(list) for _ in range(L_bands)]
    for i, sig in enumerate(signatures):
        for b in range(L_bands):
            band = tuple(sig[b*k_per_band:(b+1)*k_per_band])
            buckets[b][band].append(i)
    return buckets

매 결정 기준

Distance Family
Jaccard (sets) MinHash
Cosine SimHash / Cross-polytope
\ell_2 p-stable (Gaussian)
Hamming bit-sampling
edit distance shingle + MinHash approx

기본값: HNSW for general ANN (faster); LSH for dedup, streaming, external memory, exact-recall guarantee.

🔗 Graph

🤖 LLM 활용

언제: massive corpus dedup (e.g. pretraining cleanup), candidate blocking, streaming. 언제 X: small (n < 10⁵) 또는 high-precision recall — HNSW/IVF 가 더 빠름.

안티패턴

  • (k, L) tuning 무시: default 사용 → too many false positives or misses.
  • Wrong family: cosine 인데 MinHash 사용 → meaningless.
  • Re-hash on every query: index 재build → use persistent lib (datasketch, faiss).
  • Treating LSH as exact: 매 approximate — verify candidates with true distance.

🧪 검증 / 중복

  • Verified (Indyk & Motwani 1998 STOC, Andoni & Indyk 2008 CACM, Leskovec MMDS textbook ch 3).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — LSH families, MinHash/SimHash, banding, dedup patterns