Files
2nd/10_Wiki/Topics/AI_and_ML/Non-parametric-Models.md
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

147 lines
5.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-non-parametric-models
title: Non-parametric Models
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Non-parametric Models, Nonparametric ML, Instance-based Learning]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [ml, non-parametric, knn, decision-trees, gaussian-process, kernel, sklearn]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: python, framework: scikit-learn }
---
# Non-parametric Models
## 매 한 줄
- 비모수 모델은 고정 개수 파라미터를 두지 않고 데이터에 따라 모델 복잡도가 성장하며, k-NN·decision tree·kernel·Gaussian process가 대표 family.
## 매 핵심
- **정의**: parameter 개수가 데이터 크기에 따라 증가(또는 데이터 자체를 메모리에 저장). "no fixed parametric form".
- **대표 알고리즘**:
- k-NN: lazy learner, 거리 기반.
- Decision tree / Random Forest / Gradient Boosting: tree 깊이·개수가 데이터에 적응.
- Kernel methods (SVM with RBF, Kernel Ridge): support vector 수가 데이터 의존.
- Gaussian Process: covariance matrix(N×N) → O(N³) 학습.
- KDE(Kernel Density Estimation), Nadaraya-Watson regression.
- **장점**: 분포 가정 약함, 복잡한 비선형 관계 포착.
- **단점**: 데이터·계산량 증가, curse of dimensionality, 해석성 일부 약함.
- **vs parametric**: linear/logistic regression, GLM은 고정 파라미터 → 데이터 적어도 OK, 외삽 가능.
## 💻 패턴
```python
# k-NN classifier with distance weighting
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=15, weights="distance", metric="minkowski", p=2)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
```
```python
# Decision tree depth tuning via CV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(DecisionTreeClassifier(random_state=0),
{"max_depth": [3, 5, 10, None], "min_samples_leaf": [1, 5, 20]},
cv=5, scoring="f1_macro")
gs.fit(X, y)
print(gs.best_params_, gs.best_score_)
```
```python
# Random Forest with OOB score
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, oob_score=True, n_jobs=-1, random_state=0)
rf.fit(X, y)
print("OOB:", rf.oob_score_)
```
```python
# Kernel Ridge Regression with RBF
from sklearn.kernel_ridge import KernelRidge
kr = KernelRidge(alpha=1.0, kernel="rbf", gamma=0.1)
kr.fit(X_train, y_train)
```
```python
# Gaussian Process regression
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel
kernel = 1.0 * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)
gpr = GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=5)
gpr.fit(X_train, y_train)
mu, std = gpr.predict(X_test, return_std=True)
```
```python
# KDE for density estimation
from sklearn.neighbors import KernelDensity
import numpy as np
kde = KernelDensity(bandwidth=0.5, kernel="gaussian").fit(X_train)
log_density = kde.score_samples(X_test)
density = np.exp(log_density)
```
```python
# Nadaraya-Watson regressor (manual)
import numpy as np
def nw_regress(X_train, y_train, X_test, h=1.0):
diffs = X_test[:, None, :] - X_train[None, :, :]
w = np.exp(-np.sum(diffs**2, axis=2) / (2 * h**2))
return (w @ y_train) / (w.sum(axis=1) + 1e-9)
```
```python
# SVM with RBF kernel — support vectors grow with data
from sklearn.svm import SVC
svc = SVC(kernel="rbf", C=1.0, gamma="scale")
svc.fit(X_train, y_train)
print("n_SV:", len(svc.support_))
```
```python
# FAISS for large-scale k-NN (approximate)
import faiss, numpy as np
index = faiss.IndexHNSWFlat(X.shape[1], 32)
index.add(X.astype(np.float32))
D, I = index.search(query.astype(np.float32), k=10)
```
## 매 결정 기준
- **데이터 크기**:
- <1k → GP, kernel ridge OK.
- 1k100k → tree ensemble, k-NN(brute or KD-tree).
- >100k → tree ensemble, ANN(FAISS, ScaNN), 또는 parametric로 회귀.
- **차원**: 고차원(>50) → tree, gradient boosting. k-NN/KDE는 curse of dim.
- **uncertainty 필요**: GP > tree quantile regression > MC Dropout(parametric).
- **해석성**: shallow decision tree, kNN(prototype 분석).
## 🔗 Graph
- 관련: [[K-Nearest-Neighbors-K-NN]], [[Decision Tree]], [[Random-Forest]], [[Gradient-Boosting]], [[Gaussian-Process]], [[Kernel-Methods]]
- 도구: [[XGBoost]], [[LightGBM]], [[FAISS]]
## 🤖 LLM 활용
- 데이터셋 크기/차원 → 추천 모델 family 매트릭스 생성.
- sklearn pipeline + GridSearch boilerplate 작성.
- 결과 해석(feature importance, partial dependence plot 설명).
## ❌ 안티패턴
- 100만 row에 GaussianProcessRegressor 그대로 사용(O(N³)).
- 고차원 sparse 데이터에 k-NN 단독.
- random forest를 "blackbox"라 단정해 SHAP/PDP 안 씀.
- standardize 안 한 데이터에 RBF kernel.
## 🧪 검증
- CV(StratifiedKFold), nested CV for hyperparameter.
- learning curve(데이터 추가에 따른 성능 변화) → 비모수 특성 확인.
- 시간/메모리 프로파일링(N×N matrix가 RAM 안 넘는지).
## 🕓 Changelog
- 2026-05-08 Phase 1: 초안 자동 생성.
- 2026-05-10 Manual cleanup: 본문 보강, GP/KDE/NW/FAISS 패턴 추가, 결정 기준 정리.