"매 similar item 의 same bucket 으로의 hash". LSH 는 hash function family \mathcal{H} 가 매 distance-preserving — close points 는 collide, far points 는 separate. Indyk & Motwani (1998) 이 도입했고, 2026 에서는 ANN 의 매 classical baseline 이며 dedup, plagiarism, blocking, near-duplicate retrieval 의 매 default 로 여전히 사용 (HNSW 가 dominate 하지만 LSH 는 streaming/external memory 에 유리).
매 핵심
매 Definition
\mathcal{H} is $(r_1, r_2, p_1, p_2)$-sensitive iff:
d(x,y) \le r_1 \Rightarrow \Pr[h(x)=h(y)] \ge p_1
d(x,y) \ge r_2 \Rightarrow \Pr[h(x)=h(y)] \le p_2
r_1 < r_2, p_1 > p_2
매 Families
MinHash: Jaccard distance — set similarity
SimHash (random hyperplane): cosine — sign of w^\top x
fromdatasketchimportMinHash,MinHashLSHdefshingles(text,k=5):return{text[i:i+k]foriinrange(len(text)-k+1)}defmake_mh(s,num_perm=128):m=MinHash(num_perm=num_perm)forshins:m.update(sh.encode())returnmlsh=MinHashLSH(threshold=0.7,num_perm=128)docs={"d1":"the quick brown fox","d2":"the quick red fox"}mhs={k:make_mh(shingles(v))fork,vindocs.items()}fork,minmhs.items():lsh.insert(k,m)q=make_mh(shingles("the quick brown fox jumps"))print(lsh.query(q))# candidate set
SimHash for cosine
importnumpyasnpfromcollectionsimportdefaultdictdefsimhash(x,planes):returntuple((x@planes.T>0).astype(np.int8))# k random hyperplanesd,k,L=128,8,10tables=[np.random.randn(k,d)for_inrange(L)]defindex(X):out=[defaultdict(list)for_inrange(L)]fori,xinenumerate(X):forli,planesinenumerate(tables):out[li][simhash(x,planes)].append(i)returnoutdefquery(q,idx):cands=set()forli,planesinenumerate(tables):cands|=set(idx[li].get(simhash(q,planes),[]))returncands
p-stable LSH (L2)
# h(x) = floor((a·x + b) / w), a ~ N(0, I), b ~ U[0, w]defmake_l2_lsh(d,w=4.0,k=8):a=np.random.randn(k,d)b=np.random.uniform(0,w,k)returnlambdax:tuple(np.floor((a@x+b)/w).astype(np.int64))
LSH Forest (multi-resolution)
fromdatasketchimportMinHashLSHForestforest=MinHashLSHForest(num_perm=128)fork,minmhs.items():forest.add(k,m)forest.index()print(forest.query(q,5))# top-5 approx Jaccard NN