d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.8 KiB
5.8 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-k-nearest-neighbors-k-nn | K-Nearest Neighbors (k-NN) | 10_Wiki/Topics | verified | self |
|
none | A | 0.96 | applied |
|
2026-05-10 | pending |
|
K-Nearest Neighbors (k-NN)
매 한 줄
"매 query 의 의 의 K closest training point 의 의 의 의 vote/avg". 매 lazy learning (no training). 매 simple but effective baseline. 매 modern: 매 vector DB의 backbone (FAISS, Pinecone). 매 RAG retrieval 도 결국 k-NN.
매 핵심
매 task
- Classification: 매 majority vote.
- Regression: 매 average.
- Density estimation.
- Anomaly detection.
매 distance
- Euclidean (L2).
- Cosine (text/embed).
- Manhattan (L1).
- Hamming (binary).
- Custom (Mahalanobis).
매 efficiency
- Brute force: O(N).
- KD-tree (low-dim).
- Ball tree.
- HNSW (FAISS, modern).
- IVF (inverted file).
- PQ (product quantization).
매 응용
- Image retrieval.
- Recommendation.
- RAG retrieval.
- Anomaly detection.
- Baseline classifier.
💻 패턴
Basic (sklearn)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='euclidean')
knn.fit(X_train, y_train)
preds = knn.predict(X_test)
Cosine (for embeddings)
knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')
KD-tree (for low-dim)
from sklearn.neighbors import KDTree
tree = KDTree(X)
distances, indices = tree.query(X_query, k=5)
FAISS (large-scale)
import faiss
import numpy as np
d = 768
index = faiss.IndexFlatIP(d) # 매 inner product
faiss.normalize_L2(X)
index.add(X)
faiss.normalize_L2(query)
D, I = index.search(query, k=10)
FAISS HNSW (approximate, fast)
index = faiss.IndexHNSWFlat(d, M=32)
index.hnsw.efConstruction = 200
index.add(X)
index.hnsw.efSearch = 50
D, I = index.search(query, k=10)
FAISS IVF + PQ (massive scale)
nlist = 100
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, 8, 8) # 매 8 sub-quantizers, 8 bits each
index.train(X)
index.add(X)
index.nprobe = 10 # 매 search trade-off
D, I = index.search(query, k=10)
Annoy (alternative)
from annoy import AnnoyIndex
index = AnnoyIndex(d, 'angular') # 매 cosine
for i, v in enumerate(vectors):
index.add_item(i, v)
index.build(n_trees=10)
neighbors = index.get_nns_by_vector(query, 10)
Custom distance
from sklearn.neighbors import KNeighborsClassifier
def custom_dist(a, b):
return np.sum(np.abs(a - b)) # 매 Manhattan
knn = KNeighborsClassifier(n_neighbors=5, metric=custom_dist)
Weighted by distance
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
# 매 매 closer = 매 higher weight in vote
k-NN regression
from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor(n_neighbors=5).fit(X, y)
Anomaly detection (LOF)
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20)
anomalies = lof.fit_predict(X) == -1
k-NN with normalization (always!)
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(5))])
pipe.fit(X, y)
Choose K (CV)
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [3, 5, 7, 11, 15]}, cv=5)
grid.fit(X, y)
print(grid.best_params_)
RAG retrieval (k-NN over embeddings)
from sentence_transformers import SentenceTransformer
m = SentenceTransformer('all-mpnet-base-v2')
doc_embs = m.encode(documents)
import faiss
index = faiss.IndexFlatIP(doc_embs.shape[1])
faiss.normalize_L2(doc_embs)
index.add(doc_embs)
def retrieve(query, k=5):
q_emb = m.encode([query])
faiss.normalize_L2(q_emb)
_, I = index.search(q_emb, k)
return [documents[i] for i in I[0]]
kNN-LM (LLM augmentation)
def knn_lm_predict(context, llm, datastore, k=10):
"""매 LLM logit + retrieve nearest neighbor logit (Khandelwal 2020)."""
llm_logits = llm.next_token_logits(context)
nn_logits = datastore.knn_logits(context_emb=context.encode(), k=k)
return llm_logits + 0.25 * nn_logits # 매 simple interpolation
매 결정 기준
| 상황 | Approach |
|---|---|
| Small data | sklearn brute / KD-tree |
| High-dim | FAISS HNSW |
| Massive scale | FAISS IVF+PQ |
| Production search | Pinecone / Weaviate |
| Anomaly | LOF |
| RAG | FAISS / vector DB |
기본값: 매 normalize 의 always + 매 cosine for embed + 매 FAISS HNSW for prod + 매 CV-tuned K + 매 weighted-by-distance.
🔗 Graph
- 부모: Machine-Learning · Information Retrieval
- 변형: HNSW
- 응용: FAISS · RAG
- Adjacent: Embeddings
🤖 LLM 활용
언제: 매 baseline. 매 retrieval. 매 RAG. 언제 X: 매 high-dim raw (use embed first).
❌ 안티패턴
- No normalize: 매 magnitude dominate.
- Brute force at scale: 매 latency.
- Wrong K: 매 underfit/overfit.
- No metric thought: 매 cosine vs L2 의 wrong.
🧪 검증 / 중복
- Verified (Cover & Hart 1967, FAISS docs, Khandelwal kNN-LM 2020).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — k-NN + 매 sklearn / FAISS / HNSW / IVF / RAG / kNN-LM code |