Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

5.2 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Locality-Sensitive Hashing (LSH)

매 한 줄

"매 similar item 의 same bucket 으로의 hash". LSH 는 hash function family \mathcal{H} 가 매 distance-preserving — close points 는 collide, far points 는 separate. Indyk & Motwani (1998) 이 도입했고, 2026 에서는 ANN 의 매 classical baseline 이며 dedup, plagiarism, blocking, near-duplicate retrieval 의 매 default 로 여전히 사용 (HNSW 가 dominate 하지만 LSH 는 streaming/external memory 에 유리).

매 핵심

매 Definition

\mathcal{H} is $(r_1, r_2, p_1, p_2)$-sensitive iff:

d(x,y) \le r_1 \Rightarrow \Pr[h(x)=h(y)] \ge p_1
d(x,y) \ge r_2 \Rightarrow \Pr[h(x)=h(y)] \le p_2
r_1 < r_2, p_1 > p_2

매 Families

MinHash: Jaccard distance — set similarity
SimHash (random hyperplane): cosine — sign of w^\top x
p-stable LSH: \ell_p norms (Datar 2004)
Cross-polytope: spherical distance (state-of-art)

매 Amplification

AND: g(x) = (h_1(x), \dots, h_k(x)) — reduces p_2 to p_2^k
OR: L tables, query all → reduces miss rate
tune (k, L) for target precision/recall

매 응용

Dedup: web crawl near-dup pages (MinHash + LSH).
Plagiarism: shingled MinHash.
Blocking: entity resolution candidate generation.
ANN: cosine NN (SimHash baseline).
Genomics: sketch-based read alignment.

💻 패턴

MinHash + LSH for Jaccard

from datasketch import MinHash, MinHashLSH

def shingles(text, k=5):
    return {text[i:i+k] for i in range(len(text)-k+1)}

def make_mh(s, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for sh in s: m.update(sh.encode())
    return m

lsh = MinHashLSH(threshold=0.7, num_perm=128)
docs = {"d1": "the quick brown fox", "d2": "the quick red fox"}
mhs = {k: make_mh(shingles(v)) for k, v in docs.items()}
for k, m in mhs.items(): lsh.insert(k, m)

q = make_mh(shingles("the quick brown fox jumps"))
print(lsh.query(q))  # candidate set

SimHash for cosine

import numpy as np
from collections import defaultdict

def simhash(x, planes):
    return tuple((x @ planes.T > 0).astype(np.int8))

# k random hyperplanes
d, k, L = 128, 8, 10
tables = [np.random.randn(k, d) for _ in range(L)]

def index(X):
    out = [defaultdict(list) for _ in range(L)]
    for i, x in enumerate(X):
        for li, planes in enumerate(tables):
            out[li][simhash(x, planes)].append(i)
    return out

def query(q, idx):
    cands = set()
    for li, planes in enumerate(tables):
        cands |= set(idx[li].get(simhash(q, planes), []))
    return cands

p-stable LSH (L2)

# h(x) = floor((a·x + b) / w), a ~ N(0, I), b ~ U[0, w]
def make_l2_lsh(d, w=4.0, k=8):
    a = np.random.randn(k, d)
    b = np.random.uniform(0, w, k)
    return lambda x: tuple(np.floor((a @ x + b) / w).astype(np.int64))

LSH Forest (multi-resolution)

from datasketch import MinHashLSHForest
forest = MinHashLSHForest(num_perm=128)
for k, m in mhs.items(): forest.add(k, m)
forest.index()
print(forest.query(q, 5))  # top-5 approx Jaccard NN

Banding technique (k-AND, L-OR)

def banded_lsh(signatures, k_per_band, L_bands):
    # signatures: (n, k_per_band * L_bands)
    buckets = [defaultdict(list) for _ in range(L_bands)]
    for i, sig in enumerate(signatures):
        for b in range(L_bands):
            band = tuple(sig[b*k_per_band:(b+1)*k_per_band])
            buckets[b][band].append(i)
    return buckets

매 결정 기준

Distance	Family
Jaccard (sets)	MinHash
Cosine	SimHash / Cross-polytope
`\ell_2`	p-stable (Gaussian)
Hamming	bit-sampling
edit distance	shingle + MinHash approx

기본값: HNSW for general ANN (faster); LSH for dedup, streaming, external memory, exact-recall guarantee.

🔗 Graph

부모: Approximate-Nearest-Neighbor
변형: MinHash · SimHash
응용: Deduplication
Adjacent: HNSW · Faiss

🤖 LLM 활용

언제: massive corpus dedup (e.g. pretraining cleanup), candidate blocking, streaming. 언제 X: small (n < 10⁵) 또는 high-precision recall — HNSW/IVF 가 더 빠름.

❌ 안티패턴

(k, L) tuning 무시: default 사용 → too many false positives or misses.
Wrong family: cosine 인데 MinHash 사용 → meaningless.
Re-hash on every query: index 재build → use persistent lib (datasketch, faiss).
Treating LSH as exact: 매 approximate — verify candidates with true distance.

🧪 검증 / 중복

Verified (Indyk & Motwani 1998 STOC, Andoni & Indyk 2008 CACM, Leskovec MMDS textbook ch 3).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — LSH families, MinHash/SimHash, banding, dedup patterns

5.2 KiB Raw Blame History