Files
2nd/10_Wiki/Topics/AI_and_ML/K-Nearest-Neighbors-K-NN.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

5.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-k-nearest-neighbors-k-nn K-Nearest Neighbors (k-NN) 10_Wiki/Topics verified self
k-NN
kNN
nearest neighbor
lazy learning
FAISS
instance-based
none A 0.96 applied
machine-learning
knn
classification
regression
faiss
retrieval
2026-05-10 pending
language framework
Python scikit-learn / FAISS / Annoy

K-Nearest Neighbors (k-NN)

매 한 줄

"매 query 의 의 의 K closest training point 의 의 의 의 vote/avg". 매 lazy learning (no training). 매 simple but effective baseline. 매 modern: 매 vector DB의 backbone (FAISS, Pinecone). 매 RAG retrieval 도 결국 k-NN.

매 핵심

매 task

  • Classification: 매 majority vote.
  • Regression: 매 average.
  • Density estimation.
  • Anomaly detection.

매 distance

  • Euclidean (L2).
  • Cosine (text/embed).
  • Manhattan (L1).
  • Hamming (binary).
  • Custom (Mahalanobis).

매 efficiency

  • Brute force: O(N).
  • KD-tree (low-dim).
  • Ball tree.
  • HNSW (FAISS, modern).
  • IVF (inverted file).
  • PQ (product quantization).

매 응용

  1. Image retrieval.
  2. Recommendation.
  3. RAG retrieval.
  4. Anomaly detection.
  5. Baseline classifier.

💻 패턴

Basic (sklearn)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='euclidean')
knn.fit(X_train, y_train)
preds = knn.predict(X_test)

Cosine (for embeddings)

knn = KNeighborsClassifier(n_neighbors=5, metric='cosine')

KD-tree (for low-dim)

from sklearn.neighbors import KDTree
tree = KDTree(X)
distances, indices = tree.query(X_query, k=5)

FAISS (large-scale)

import faiss
import numpy as np

d = 768
index = faiss.IndexFlatIP(d)  # 매 inner product
faiss.normalize_L2(X)
index.add(X)

faiss.normalize_L2(query)
D, I = index.search(query, k=10)

FAISS HNSW (approximate, fast)

index = faiss.IndexHNSWFlat(d, M=32)
index.hnsw.efConstruction = 200
index.add(X)
index.hnsw.efSearch = 50
D, I = index.search(query, k=10)

FAISS IVF + PQ (massive scale)

nlist = 100
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, 8, 8)  # 매 8 sub-quantizers, 8 bits each
index.train(X)
index.add(X)
index.nprobe = 10  # 매 search trade-off
D, I = index.search(query, k=10)

Annoy (alternative)

from annoy import AnnoyIndex
index = AnnoyIndex(d, 'angular')  # 매 cosine
for i, v in enumerate(vectors):
    index.add_item(i, v)
index.build(n_trees=10)
neighbors = index.get_nns_by_vector(query, 10)

Custom distance

from sklearn.neighbors import KNeighborsClassifier

def custom_dist(a, b):
    return np.sum(np.abs(a - b))  # 매 Manhattan

knn = KNeighborsClassifier(n_neighbors=5, metric=custom_dist)

Weighted by distance

knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
# 매 매 closer = 매 higher weight in vote

k-NN regression

from sklearn.neighbors import KNeighborsRegressor
knr = KNeighborsRegressor(n_neighbors=5).fit(X, y)

Anomaly detection (LOF)

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20)
anomalies = lof.fit_predict(X) == -1

k-NN with normalization (always!)

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(5))])
pipe.fit(X, y)

Choose K (CV)

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [3, 5, 7, 11, 15]}, cv=5)
grid.fit(X, y)
print(grid.best_params_)

RAG retrieval (k-NN over embeddings)

from sentence_transformers import SentenceTransformer
m = SentenceTransformer('all-mpnet-base-v2')
doc_embs = m.encode(documents)

import faiss
index = faiss.IndexFlatIP(doc_embs.shape[1])
faiss.normalize_L2(doc_embs)
index.add(doc_embs)

def retrieve(query, k=5):
    q_emb = m.encode([query])
    faiss.normalize_L2(q_emb)
    _, I = index.search(q_emb, k)
    return [documents[i] for i in I[0]]

kNN-LM (LLM augmentation)

def knn_lm_predict(context, llm, datastore, k=10):
    """매 LLM logit + retrieve nearest neighbor logit (Khandelwal 2020)."""
    llm_logits = llm.next_token_logits(context)
    nn_logits = datastore.knn_logits(context_emb=context.encode(), k=k)
    return llm_logits + 0.25 * nn_logits  # 매 simple interpolation

매 결정 기준

상황 Approach
Small data sklearn brute / KD-tree
High-dim FAISS HNSW
Massive scale FAISS IVF+PQ
Production search Pinecone / Weaviate
Anomaly LOF
RAG FAISS / vector DB

기본값: 매 normalize 의 always + 매 cosine for embed + 매 FAISS HNSW for prod + 매 CV-tuned K + 매 weighted-by-distance.

🔗 Graph

🤖 LLM 활용

언제: 매 baseline. 매 retrieval. 매 RAG. 언제 X: 매 high-dim raw (use embed first).

안티패턴

  • No normalize: 매 magnitude dominate.
  • Brute force at scale: 매 latency.
  • Wrong K: 매 underfit/overfit.
  • No metric thought: 매 cosine vs L2 의 wrong.

🧪 검증 / 중복

  • Verified (Cover & Hart 1967, FAISS docs, Khandelwal kNN-LM 2020).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — k-NN + 매 sklearn / FAISS / HNSW / IVF / RAG / kNN-LM code