--- id: wiki-2026-0508-k-nearest-neighbors-k-nn title: K-Nearest Neighbors (k-NN) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [k-NN, kNN, nearest neighbor, lazy learning, FAISS, instance-based] duplicate_of: none source_trust_level: A confidence_score: 0.96 verification_status: applied tags: [machine-learning, knn, classification, regression, faiss, retrieval] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn / FAISS / Annoy --- # K-Nearest Neighbors (k-NN) ## 매 한 줄 > **"매 query 의 의 의 K closest training point 의 의 의 의 vote/avg"**. 매 lazy learning (no training). 매 simple but effective baseline. 매 modern: 매 vector DB의 backbone (FAISS, Pinecone). 매 RAG retrieval 도 결국 k-NN. ## 매 핵심 ### 매 task - **Classification**: 매 majority vote. - **Regression**: 매 average. - **Density estimation**. - **Anomaly detection**. ### 매 distance - **Euclidean** (L2). - **Cosine** (text/embed). - **Manhattan** (L1). - **Hamming** (binary). - **Custom** (Mahalanobis). ### 매 efficiency - **Brute force**: O(N). - **KD-tree** (low-dim). - **Ball tree**. - **HNSW** (FAISS, modern). - **IVF** (inverted file). - **PQ** (product quantization). ### 매 응용 1. **Image retrieval**. 2. **Recommendation**. 3. **RAG retrieval**. 4. **Anomaly detection**. 5. **Baseline classifier**. ## 💻 패턴 ### Basic (sklearn) ```python from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=5, weights='distance', metric='euclidean') knn.fit(X_train, y_train) preds = knn.predict(X_test) ``` ### Cosine (for embeddings) ```python knn = KNeighborsClassifier(n_neighbors=5, metric='cosine') ``` ### KD-tree (for low-dim) ```python from sklearn.neighbors import KDTree tree = KDTree(X) distances, indices = tree.query(X_query, k=5) ``` ### FAISS (large-scale) ```python import faiss import numpy as np d = 768 index = faiss.IndexFlatIP(d) # 매 inner product faiss.normalize_L2(X) index.add(X) faiss.normalize_L2(query) D, I = index.search(query, k=10) ``` ### FAISS HNSW (approximate, fast) ```python index = faiss.IndexHNSWFlat(d, M=32) index.hnsw.efConstruction = 200 index.add(X) index.hnsw.efSearch = 50 D, I = index.search(query, k=10) ``` ### FAISS IVF + PQ (massive scale) ```python nlist = 100 quantizer = faiss.IndexFlatL2(d) index = faiss.IndexIVFPQ(quantizer, d, nlist, 8, 8) # 매 8 sub-quantizers, 8 bits each index.train(X) index.add(X) index.nprobe = 10 # 매 search trade-off D, I = index.search(query, k=10) ``` ### Annoy (alternative) ```python from annoy import AnnoyIndex index = AnnoyIndex(d, 'angular') # 매 cosine for i, v in enumerate(vectors): index.add_item(i, v) index.build(n_trees=10) neighbors = index.get_nns_by_vector(query, 10) ``` ### Custom distance ```python from sklearn.neighbors import KNeighborsClassifier def custom_dist(a, b): return np.sum(np.abs(a - b)) # 매 Manhattan knn = KNeighborsClassifier(n_neighbors=5, metric=custom_dist) ``` ### Weighted by distance ```python knn = KNeighborsClassifier(n_neighbors=5, weights='distance') # 매 매 closer = 매 higher weight in vote ``` ### k-NN regression ```python from sklearn.neighbors import KNeighborsRegressor knr = KNeighborsRegressor(n_neighbors=5).fit(X, y) ``` ### Anomaly detection (LOF) ```python from sklearn.neighbors import LocalOutlierFactor lof = LocalOutlierFactor(n_neighbors=20) anomalies = lof.fit_predict(X) == -1 ``` ### k-NN with normalization (always!) ```python from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline pipe = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(5))]) pipe.fit(X, y) ``` ### Choose K (CV) ```python from sklearn.model_selection import GridSearchCV grid = GridSearchCV(KNeighborsClassifier(), {'n_neighbors': [3, 5, 7, 11, 15]}, cv=5) grid.fit(X, y) print(grid.best_params_) ``` ### RAG retrieval (k-NN over embeddings) ```python from sentence_transformers import SentenceTransformer m = SentenceTransformer('all-mpnet-base-v2') doc_embs = m.encode(documents) import faiss index = faiss.IndexFlatIP(doc_embs.shape[1]) faiss.normalize_L2(doc_embs) index.add(doc_embs) def retrieve(query, k=5): q_emb = m.encode([query]) faiss.normalize_L2(q_emb) _, I = index.search(q_emb, k) return [documents[i] for i in I[0]] ``` ### kNN-LM (LLM augmentation) ```python def knn_lm_predict(context, llm, datastore, k=10): """매 LLM logit + retrieve nearest neighbor logit (Khandelwal 2020).""" llm_logits = llm.next_token_logits(context) nn_logits = datastore.knn_logits(context_emb=context.encode(), k=k) return llm_logits + 0.25 * nn_logits # 매 simple interpolation ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Small data | sklearn brute / KD-tree | | High-dim | FAISS HNSW | | Massive scale | FAISS IVF+PQ | | Production search | Pinecone / Weaviate | | Anomaly | LOF | | RAG | FAISS / vector DB | **기본값**: 매 normalize 의 always + 매 cosine for embed + 매 FAISS HNSW for prod + 매 CV-tuned K + 매 weighted-by-distance. ## 🔗 Graph - 부모: [[Machine-Learning]] · [[Information Retrieval]] - 변형: [[HNSW]] - 응용: [[FAISS]] · [[RAG]] - Adjacent: [[Embeddings]] ## 🤖 LLM 활용 **언제**: 매 baseline. 매 retrieval. 매 RAG. **언제 X**: 매 high-dim raw (use embed first). ## ❌ 안티패턴 - **No normalize**: 매 magnitude dominate. - **Brute force at scale**: 매 latency. - **Wrong K**: 매 underfit/overfit. - **No metric thought**: 매 cosine vs L2 의 wrong. ## 🧪 검증 / 중복 - Verified (Cover & Hart 1967, FAISS docs, Khandelwal kNN-LM 2020). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — k-NN + 매 sklearn / FAISS / HNSW / IVF / RAG / kNN-LM code |