---
id: wiki-2026-0508-kernel-density-estimation-kde
title: Kernel Density Estimation (KDE)
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [KDE, Parzen Window, Density Estimation]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [statistics, non-parametric, density-estimation, kernel]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: scipy, scikit-learn, KDEpy
---

# Kernel Density Estimation (KDE)

## 매 한 줄
> **"매 histogram 의 smooth 한 generalization"**. KDE 는 non-parametric density estimator 로, 매 sample point 에 kernel function 을 placing 하고 sum 하여 continuous PDF 추정. Parzen (1962) 와 Rosenblatt (1956) 이 정립했으며, 2026 modern stats/ML 에서 anomaly detection, generative sampling, visualization 에 사용.

## 매 핵심

### 매 수식
- $\hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)$
- $K$ = kernel (Gaussian, Epanechnikov, …)
- $h$ = bandwidth (smoothing parameter)
- multi-D: $\hat{f}_H(x) = \frac{1}{n|H|^{1/2}}\sum K(H^{-1/2}(x-x_i))$

### 매 Bandwidth selection
- Silverman's rule: $h = 1.06 \hat{\sigma} n^{-1/5}$
- Scott's rule: $h = n^{-1/(d+4)}$
- cross-validation (likelihood)
- plug-in estimators (Sheather-Jones)

### 매 응용
1. EDA visualization (seaborn `kdeplot`).
2. Anomaly detection (low-density = outlier).
3. Mode finding (mean-shift).
4. Bayesian non-parametric prior.
5. Generative sampling (smoothed bootstrap).

## 💻 패턴

### scipy KDE
```python
from scipy.stats import gaussian_kde
import numpy as np

x = np.random.normal(0, 1, 1000)
kde = gaussian_kde(x, bw_method="silverman")

xs = np.linspace(-4, 4, 200)
density = kde(xs)
```

### sklearn KernelDensity
```python
from sklearn.neighbors import KernelDensity
import numpy as np

X = np.random.randn(1000, 2)
kde = KernelDensity(kernel="gaussian", bandwidth=0.3).fit(X)
log_dens = kde.score_samples(X)  # log-density at each point

# anomaly: lowest 1% as outliers
threshold = np.quantile(log_dens, 0.01)
outliers = X[log_dens < threshold]
```

### Bandwidth via cross-validation
```python
from sklearn.model_selection import GridSearchCV

params = {"bandwidth": np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params, cv=5)
grid.fit(X)
print(grid.best_params_)
```

### KDEpy fast FFT-based KDE
```python
from KDEpy import FFTKDE
x_grid, y = FFTKDE(kernel="gaussian", bw="silverman").fit(x).evaluate()
# O(n + m log m) instead of O(n*m)
```

### Adaptive bandwidth
```python
def adaptive_kde(x, x_eval, k=10):
    from scipy.spatial import cKDTree
    tree = cKDTree(x[:, None])
    dists, _ = tree.query(x[:, None], k=k+1)
    h_local = dists[:, -1]  # k-NN distance per point
    out = np.zeros_like(x_eval)
    for xi, hi in zip(x, h_local):
        out += np.exp(-0.5*((x_eval - xi)/hi)**2) / hi
    return out / (len(x) * np.sqrt(2*np.pi))
```

### Visualization
```python
import seaborn as sns
sns.kdeplot(data=df, x="feature", hue="class", fill=True, common_norm=False)
```

## 매 결정 기준
| 상황 | Method |
|---|---|
| 1D, small n | scipy gaussian_kde |
| high-D, n>10⁴ | FFTKDE |
| streaming | online KDE (Heinz 2008) |
| boundaries | reflection / log-transform |
| heavy-tail | adaptive bandwidth |

**기본값**: Silverman + Gaussian kernel, then validate.

## 🔗 Graph
- 부모: [[Density-Estimation]]
- 응용: [[Anomaly-Detection]]
- Adjacent: [[Kernel-Methods]]

## 🤖 LLM 활용
**언제**: small/mid n, distribution shape 알 수 없을 때.
**언제 X**: very high-D (curse of dimensionality), n < 30.

## ❌ 안티패턴
- **Default bandwidth blind use**: Silverman 은 Gaussian 가정 — bimodal 에 over-smooth.
- **Boundary bias 무시**: support [0, ∞) 인데 Gaussian kernel 사용 → leak 발생.
- **High-D KDE**: d > 6 에서는 거의 useless — vine copula 또는 normalizing flow 사용.
- **Sample size 무시**: n < 50 KDE 결과는 거의 noise.

## 🧪 검증 / 중복
- Verified (Silverman 1986 textbook, Wand & Jones 1995, Chen 2017 review).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — KDE math, bandwidth selection, scipy/sklearn/KDEpy |