--- id: wiki-2026-0508-non-parametric-models title: Non-parametric Models category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Non-parametric Models, Nonparametric ML, Instance-based Learning] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [ml, non-parametric, knn, decision-trees, gaussian-process, kernel, sklearn] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: { language: python, framework: scikit-learn } --- # Non-parametric Models ## 매 한 줄 - 비모수 모델은 고정 개수 파라미터를 두지 않고 데이터에 따라 모델 복잡도가 성장하며, k-NN·decision tree·kernel·Gaussian process가 대표 family. ## 매 핵심 - **정의**: parameter 개수가 데이터 크기에 따라 증가(또는 데이터 자체를 메모리에 저장). "no fixed parametric form". - **대표 알고리즘**: - k-NN: lazy learner, 거리 기반. - Decision tree / Random Forest / Gradient Boosting: tree 깊이·개수가 데이터에 적응. - Kernel methods (SVM with RBF, Kernel Ridge): support vector 수가 데이터 의존. - Gaussian Process: covariance matrix(N×N) → O(N³) 학습. - KDE(Kernel Density Estimation), Nadaraya-Watson regression. - **장점**: 분포 가정 약함, 복잡한 비선형 관계 포착. - **단점**: 데이터·계산량 증가, curse of dimensionality, 해석성 일부 약함. - **vs parametric**: linear/logistic regression, GLM은 고정 파라미터 → 데이터 적어도 OK, 외삽 가능. ## 💻 패턴 ```python # k-NN classifier with distance weighting from sklearn.neighbors import KNeighborsClassifier clf = KNeighborsClassifier(n_neighbors=15, weights="distance", metric="minkowski", p=2) clf.fit(X_train, y_train) pred = clf.predict(X_test) ``` ```python # Decision tree depth tuning via CV from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import GridSearchCV gs = GridSearchCV(DecisionTreeClassifier(random_state=0), {"max_depth": [3, 5, 10, None], "min_samples_leaf": [1, 5, 20]}, cv=5, scoring="f1_macro") gs.fit(X, y) print(gs.best_params_, gs.best_score_) ``` ```python # Random Forest with OOB score from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=500, oob_score=True, n_jobs=-1, random_state=0) rf.fit(X, y) print("OOB:", rf.oob_score_) ``` ```python # Kernel Ridge Regression with RBF from sklearn.kernel_ridge import KernelRidge kr = KernelRidge(alpha=1.0, kernel="rbf", gamma=0.1) kr.fit(X_train, y_train) ``` ```python # Gaussian Process regression from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import RBF, WhiteKernel kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1) gpr = GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=5) gpr.fit(X_train, y_train) mu, std = gpr.predict(X_test, return_std=True) ``` ```python # KDE for density estimation from sklearn.neighbors import KernelDensity import numpy as np kde = KernelDensity(bandwidth=0.5, kernel="gaussian").fit(X_train) log_density = kde.score_samples(X_test) density = np.exp(log_density) ``` ```python # Nadaraya-Watson regressor (manual) import numpy as np def nw_regress(X_train, y_train, X_test, h=1.0): diffs = X_test[:, None, :] - X_train[None, :, :] w = np.exp(-np.sum(diffs**2, axis=2) / (2 * h**2)) return (w @ y_train) / (w.sum(axis=1) + 1e-9) ``` ```python # SVM with RBF kernel — support vectors grow with data from sklearn.svm import SVC svc = SVC(kernel="rbf", C=1.0, gamma="scale") svc.fit(X_train, y_train) print("n_SV:", len(svc.support_)) ``` ```python # FAISS for large-scale k-NN (approximate) import faiss, numpy as np index = faiss.IndexHNSWFlat(X.shape[1], 32) index.add(X.astype(np.float32)) D, I = index.search(query.astype(np.float32), k=10) ``` ## 매 결정 기준 - **데이터 크기**: - <1k → GP, kernel ridge OK. - 1k–100k → tree ensemble, k-NN(brute or KD-tree). - >100k → tree ensemble, ANN(FAISS, ScaNN), 또는 parametric로 회귀. - **차원**: 고차원(>50) → tree, gradient boosting. k-NN/KDE는 curse of dim. - **uncertainty 필요**: GP > tree quantile regression > MC Dropout(parametric). - **해석성**: shallow decision tree, kNN(prototype 분석). ## 🔗 Graph - 관련: [[K-Nearest-Neighbors-K-NN]], [[Decision Tree]], [[Random-Forest]], [[Gradient-Boosting]], [[Gaussian-Process]], [[Kernel-Methods]] - 도구: [[XGBoost]], [[LightGBM]], [[FAISS]] ## 🤖 LLM 활용 - 데이터셋 크기/차원 → 추천 모델 family 매트릭스 생성. - sklearn pipeline + GridSearch boilerplate 작성. - 결과 해석(feature importance, partial dependence plot 설명). ## ❌ 안티패턴 - 100만 row에 GaussianProcessRegressor 그대로 사용(O(N³)). - 고차원 sparse 데이터에 k-NN 단독. - random forest를 "blackbox"라 단정해 SHAP/PDP 안 씀. - standardize 안 한 데이터에 RBF kernel. ## 🧪 검증 - CV(StratifiedKFold), nested CV for hyperparameter. - learning curve(데이터 추가에 따른 성능 변화) → 비모수 특성 확인. - 시간/메모리 프로파일링(N×N matrix가 RAM 안 넘는지). ## 🕓 Changelog - 2026-05-08 Phase 1: 초안 자동 생성. - 2026-05-10 Manual cleanup: 본문 보강, GP/KDE/NW/FAISS 패턴 추가, 결정 기준 정리.