Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

5.8 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

K-Nearest Neighbors (k-NN)

매 한 줄

"매 query 의 의 의 K closest training point 의 의 의 의 vote/avg". 매 lazy learning (no training). 매 simple but effective baseline. 매 modern: 매 vector DB의 backbone (FAISS, Pinecone). 매 RAG retrieval 도 결국 k-NN.

매 핵심

매 task

Classification: 매 majority vote.
Regression: 매 average.
Density estimation.
Anomaly detection.

매 distance

Euclidean (L2).
Cosine (text/embed).
Manhattan (L1).
Hamming (binary).
Custom (Mahalanobis).

매 efficiency

Brute force: O(N).
KD-tree (low-dim).
Ball tree.
HNSW (FAISS, modern).
IVF (inverted file).
PQ (product quantization).

매 응용

Image retrieval.
Recommendation.
RAG retrieval.
Anomaly detection.
Baseline classifier.

💻 패턴

Basic (sklearn)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='euclidean')
knn.fit(X_train, y_train)
preds = knn.predict(X_test)

Cosine (for embeddings)

knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')

KD-tree (for low-dim)

from sklearn.neighbors import KDTree
tree = KDTree(X)
distances, indices = tree.query(X_query, k=5)

FAISS (large-scale)

import faiss
import numpy as np

d = 768
index = faiss.IndexFlatIP(d)  # 매 inner product
faiss.normalize_L2(X)
index.add(X)

faiss.normalize_L2(query)
D, I = index.search(query, k=10)

FAISS HNSW (approximate, fast)

index = faiss.IndexHNSWFlat(d, M=32)
index.hnsw.efConstruction = 200
index.add(X)
index.hnsw.efSearch = 50
D, I = index.search(query, k=10)

FAISS IVF + PQ (massive scale)

nlist = 100
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, 8, 8)  # 매 8 sub-quantizers, 8 bits each
index.train(X)
index.add(X)
index.nprobe = 10  # 매 search trade-off
D, I = index.search(query, k=10)

Annoy (alternative)

from annoy import AnnoyIndex
index = AnnoyIndex(d, 'angular')  # 매 cosine
for i, v in enumerate(vectors):
    index.add_item(i, v)
index.build(n_trees=10)
neighbors = index.get_nns_by_vector(query, 10)

Custom distance

from sklearn.neighbors import KNeighborsClassifier

def custom_dist(a, b):
    return np.sum(np.abs(a - b))  # 매 Manhattan

knn = KNeighborsClassifier(n_neighbors=5, metric=custom_dist)

Weighted by distance

knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
# 매 매 closer = 매 higher weight in vote

k-NN regression

from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor(n_neighbors=5).fit(X, y)

Anomaly detection (LOF)

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20)
anomalies = lof.fit_predict(X) == -1

k-NN with normalization (always!)

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(5))])
pipe.fit(X, y)

Choose K (CV)

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [3, 5, 7, 11, 15]}, cv=5)
grid.fit(X, y)
print(grid.best_params_)

RAG retrieval (k-NN over embeddings)

from sentence_transformers import SentenceTransformer
m = SentenceTransformer('all-mpnet-base-v2')
doc_embs = m.encode(documents)

import faiss
index = faiss.IndexFlatIP(doc_embs.shape[1])
faiss.normalize_L2(doc_embs)
index.add(doc_embs)

def retrieve(query, k=5):
    q_emb = m.encode([query])
    faiss.normalize_L2(q_emb)
    _, I = index.search(q_emb, k)
    return [documents[i] for i in I[0]]

kNN-LM (LLM augmentation)

def knn_lm_predict(context, llm, datastore, k=10):
    """매 LLM logit + retrieve nearest neighbor logit (Khandelwal 2020)."""
    llm_logits = llm.next_token_logits(context)
    nn_logits = datastore.knn_logits(context_emb=context.encode(), k=k)
    return llm_logits + 0.25 * nn_logits  # 매 simple interpolation

매 결정 기준

상황	Approach
Small data	sklearn brute / KD-tree
High-dim	FAISS HNSW
Massive scale	FAISS IVF+PQ
Production search	Pinecone / Weaviate
Anomaly	LOF
RAG	FAISS / vector DB

기본값: 매 normalize 의 always + 매 cosine for embed + 매 FAISS HNSW for prod + 매 CV-tuned K + 매 weighted-by-distance.

🔗 Graph

부모: Machine-Learning · Information Retrieval
변형: HNSW
응용: FAISS · RAG
Adjacent: Embeddings

🤖 LLM 활용

언제: 매 baseline. 매 retrieval. 매 RAG. 언제 X: 매 high-dim raw (use embed first).

❌ 안티패턴

No normalize: 매 magnitude dominate.
Brute force at scale: 매 latency.
Wrong K: 매 underfit/overfit.
No metric thought: 매 cosine vs L2 의 wrong.

🧪 검증 / 중복

Verified (Cover & Hart 1967, FAISS docs, Khandelwal kNN-LM 2020).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — k-NN + 매 sklearn / FAISS / HNSW / IVF / RAG / kNN-LM code

5.8 KiB Raw Blame History