Files
2nd/10_Wiki/Topics/AI_and_ML/K-Nearest-Neighbors-K-NN.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-k-nearest-neighbors-k-nn K-Nearest Neighbors (k-NN) 10_Wiki/Topics verified self
k-NN
kNN
nearest neighbor
lazy learning
FAISS
instance-based
none A 0.96 applied
machine-learning
knn
classification
regression
faiss
retrieval
2026-05-10 pending
language framework
Python scikit-learn / FAISS / Annoy

K-Nearest Neighbors (k-NN)

매 한 줄

"매 query 의 의 의 K closest training point 의 의 의 의 vote/avg". 매 lazy learning (no training). 매 simple but effective baseline. 매 modern: 매 vector DB의 backbone (FAISS, Pinecone). 매 RAG retrieval 도 결국 k-NN.

매 핵심

매 task

  • Classification: 매 majority vote.
  • Regression: 매 average.
  • Density estimation.
  • Anomaly detection.

매 distance

  • Euclidean (L2).
  • Cosine (text/embed).
  • Manhattan (L1).
  • Hamming (binary).
  • Custom (Mahalanobis).

매 efficiency

  • Brute force: O(N).
  • KD-tree (low-dim).
  • Ball tree.
  • HNSW (FAISS, modern).
  • IVF (inverted file).
  • PQ (product quantization).

매 응용

  1. Image retrieval.
  2. Recommendation.
  3. RAG retrieval.
  4. Anomaly detection.
  5. Baseline classifier.

💻 패턴

Basic (sklearn)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='euclidean')
knn.fit(X_train, y_train)
preds = knn.predict(X_test)

Cosine (for embeddings)

knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')

KD-tree (for low-dim)

from sklearn.neighbors import KDTree
tree = KDTree(X)
distances, indices = tree.query(X_query, k=5)

FAISS (large-scale)

import faiss
import numpy as np

d = 768
index = faiss.IndexFlatIP(d)  # 매 inner product
faiss.normalize_L2(X)
index.add(X)

faiss.normalize_L2(query)
D, I = index.search(query, k=10)

FAISS HNSW (approximate, fast)

index = faiss.IndexHNSWFlat(d, M=32)
index.hnsw.efConstruction = 200
index.add(X)
index.hnsw.efSearch = 50
D, I = index.search(query, k=10)

FAISS IVF + PQ (massive scale)

nlist = 100
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, 8, 8)  # 매 8 sub-quantizers, 8 bits each
index.train(X)
index.add(X)
index.nprobe = 10  # 매 search trade-off
D, I = index.search(query, k=10)

Annoy (alternative)

from annoy import AnnoyIndex
index = AnnoyIndex(d, 'angular')  # 매 cosine
for i, v in enumerate(vectors):
    index.add_item(i, v)
index.build(n_trees=10)
neighbors = index.get_nns_by_vector(query, 10)

Custom distance

from sklearn.neighbors import KNeighborsClassifier

def custom_dist(a, b):
    return np.sum(np.abs(a - b))  # 매 Manhattan

knn = KNeighborsClassifier(n_neighbors=5, metric=custom_dist)

Weighted by distance

knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
# 매 매 closer = 매 higher weight in vote

k-NN regression

from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor(n_neighbors=5).fit(X, y)

Anomaly detection (LOF)

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20)
anomalies = lof.fit_predict(X) == -1

k-NN with normalization (always!)

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(5))])
pipe.fit(X, y)

Choose K (CV)

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [3, 5, 7, 11, 15]}, cv=5)
grid.fit(X, y)
print(grid.best_params_)

RAG retrieval (k-NN over embeddings)

from sentence_transformers import SentenceTransformer
m = SentenceTransformer('all-mpnet-base-v2')
doc_embs = m.encode(documents)

import faiss
index = faiss.IndexFlatIP(doc_embs.shape[1])
faiss.normalize_L2(doc_embs)
index.add(doc_embs)

def retrieve(query, k=5):
    q_emb = m.encode([query])
    faiss.normalize_L2(q_emb)
    _, I = index.search(q_emb, k)
    return [documents[i] for i in I[0]]

kNN-LM (LLM augmentation)

def knn_lm_predict(context, llm, datastore, k=10):
    """매 LLM logit + retrieve nearest neighbor logit (Khandelwal 2020)."""
    llm_logits = llm.next_token_logits(context)
    nn_logits = datastore.knn_logits(context_emb=context.encode(), k=k)
    return llm_logits + 0.25 * nn_logits  # 매 simple interpolation

매 결정 기준

상황 Approach
Small data sklearn brute / KD-tree
High-dim FAISS HNSW
Massive scale FAISS IVF+PQ
Production search Pinecone / Weaviate
Anomaly LOF
RAG FAISS / vector DB

기본값: 매 normalize 의 always + 매 cosine for embed + 매 FAISS HNSW for prod + 매 CV-tuned K + 매 weighted-by-distance.

🔗 Graph

🤖 LLM 활용

언제: 매 baseline. 매 retrieval. 매 RAG. 언제 X: 매 high-dim raw (use embed first).

안티패턴

  • No normalize: 매 magnitude dominate.
  • Brute force at scale: 매 latency.
  • Wrong K: 매 underfit/overfit.
  • No metric thought: 매 cosine vs L2 의 wrong.

🧪 검증 / 중복

  • Verified (Cover & Hart 1967, FAISS docs, Khandelwal kNN-LM 2020).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — k-NN + 매 sklearn / FAISS / HNSW / IVF / RAG / kNN-LM code