--- id: wiki-2026-0508-kernel-density-estimation-kde title: Kernel Density Estimation (KDE) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [KDE, Parzen Window, Density Estimation] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [statistics, non-parametric, density-estimation, kernel] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: scipy, scikit-learn, KDEpy --- # Kernel Density Estimation (KDE) ## 매 한 줄 > **"매 histogram 의 smooth 한 generalization"**. KDE 는 non-parametric density estimator 로, 매 sample point 에 kernel function 을 placing 하고 sum 하여 continuous PDF 추정. Parzen (1962) 와 Rosenblatt (1956) 이 정립했으며, 2026 modern stats/ML 에서 anomaly detection, generative sampling, visualization 에 사용. ## 매 핵심 ### 매 수식 - $\hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)$ - $K$ = kernel (Gaussian, Epanechnikov, …) - $h$ = bandwidth (smoothing parameter) - multi-D: $\hat{f}_H(x) = \frac{1}{n|H|^{1/2}}\sum K(H^{-1/2}(x-x_i))$ ### 매 Bandwidth selection - Silverman's rule: $h = 1.06 \hat{\sigma} n^{-1/5}$ - Scott's rule: $h = n^{-1/(d+4)}$ - cross-validation (likelihood) - plug-in estimators (Sheather-Jones) ### 매 응용 1. EDA visualization (seaborn `kdeplot`). 2. Anomaly detection (low-density = outlier). 3. Mode finding (mean-shift). 4. Bayesian non-parametric prior. 5. Generative sampling (smoothed bootstrap). ## 💻 패턴 ### scipy KDE ```python from scipy.stats import gaussian_kde import numpy as np x = np.random.normal(0, 1, 1000) kde = gaussian_kde(x, bw_method="silverman") xs = np.linspace(-4, 4, 200) density = kde(xs) ``` ### sklearn KernelDensity ```python from sklearn.neighbors import KernelDensity import numpy as np X = np.random.randn(1000, 2) kde = KernelDensity(kernel="gaussian", bandwidth=0.3).fit(X) log_dens = kde.score_samples(X) # log-density at each point # anomaly: lowest 1% as outliers threshold = np.quantile(log_dens, 0.01) outliers = X[log_dens < threshold] ``` ### Bandwidth via cross-validation ```python from sklearn.model_selection import GridSearchCV params = {"bandwidth": np.logspace(-1, 1, 20)} grid = GridSearchCV(KernelDensity(), params, cv=5) grid.fit(X) print(grid.best_params_) ``` ### KDEpy fast FFT-based KDE ```python from KDEpy import FFTKDE x_grid, y = FFTKDE(kernel="gaussian", bw="silverman").fit(x).evaluate() # O(n + m log m) instead of O(n*m) ``` ### Adaptive bandwidth ```python def adaptive_kde(x, x_eval, k=10): from scipy.spatial import cKDTree tree = cKDTree(x[:, None]) dists, _ = tree.query(x[:, None], k=k+1) h_local = dists[:, -1] # k-NN distance per point out = np.zeros_like(x_eval) for xi, hi in zip(x, h_local): out += np.exp(-0.5*((x_eval - xi)/hi)**2) / hi return out / (len(x) * np.sqrt(2*np.pi)) ``` ### Visualization ```python import seaborn as sns sns.kdeplot(data=df, x="feature", hue="class", fill=True, common_norm=False) ``` ## 매 결정 기준 | 상황 | Method | |---|---| | 1D, small n | scipy gaussian_kde | | high-D, n>10⁴ | FFTKDE | | streaming | online KDE (Heinz 2008) | | boundaries | reflection / log-transform | | heavy-tail | adaptive bandwidth | **기본값**: Silverman + Gaussian kernel, then validate. ## 🔗 Graph - 부모: [[Density-Estimation]] - 응용: [[Anomaly-Detection]] - Adjacent: [[Kernel-Methods]] ## 🤖 LLM 활용 **언제**: small/mid n, distribution shape 알 수 없을 때. **언제 X**: very high-D (curse of dimensionality), n < 30. ## ❌ 안티패턴 - **Default bandwidth blind use**: Silverman 은 Gaussian 가정 — bimodal 에 over-smooth. - **Boundary bias 무시**: support [0, ∞) 인데 Gaussian kernel 사용 → leak 발생. - **High-D KDE**: d > 6 에서는 거의 useless — vine copula 또는 normalizing flow 사용. - **Sample size 무시**: n < 50 KDE 결과는 거의 noise. ## 🧪 검증 / 중복 - Verified (Silverman 1986 textbook, Wand & Jones 1995, Chen 2017 review). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — KDE math, bandwidth selection, scipy/sklearn/KDEpy |