Files
2nd/10_Wiki/Topics/Computer_Science_and_Theory/Kernel-Density-Estimation-KDE.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.2 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-kernel-density-estimation-kde Kernel Density Estimation (KDE) 10_Wiki/Topics verified self
KDE
Parzen Window
Density Estimation
none A 0.9 applied
statistics
non-parametric
density-estimation
kernel
2026-05-10 pending
language framework
python scipy, scikit-learn, KDEpy

Kernel Density Estimation (KDE)

매 한 줄

"매 histogram 의 smooth 한 generalization". KDE 는 non-parametric density estimator 로, 매 sample point 에 kernel function 을 placing 하고 sum 하여 continuous PDF 추정. Parzen (1962) 와 Rosenblatt (1956) 이 정립했으며, 2026 modern stats/ML 에서 anomaly detection, generative sampling, visualization 에 사용.

매 핵심

매 수식

  • \hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)
  • K = kernel (Gaussian, Epanechnikov, …)
  • h = bandwidth (smoothing parameter)
  • multi-D: \hat{f}_H(x) = \frac{1}{n|H|^{1/2}}\sum K(H^{-1/2}(x-x_i))

매 Bandwidth selection

  • Silverman's rule: h = 1.06 \hat{\sigma} n^{-1/5}
  • Scott's rule: h = n^{-1/(d+4)}
  • cross-validation (likelihood)
  • plug-in estimators (Sheather-Jones)

매 응용

  1. EDA visualization (seaborn kdeplot).
  2. Anomaly detection (low-density = outlier).
  3. Mode finding (mean-shift).
  4. Bayesian non-parametric prior.
  5. Generative sampling (smoothed bootstrap).

💻 패턴

scipy KDE

from scipy.stats import gaussian_kde
import numpy as np

x = np.random.normal(0, 1, 1000)
kde = gaussian_kde(x, bw_method="silverman")

xs = np.linspace(-4, 4, 200)
density = kde(xs)

sklearn KernelDensity

from sklearn.neighbors import KernelDensity
import numpy as np

X = np.random.randn(1000, 2)
kde = KernelDensity(kernel="gaussian", bandwidth=0.3).fit(X)
log_dens = kde.score_samples(X)  # log-density at each point

# anomaly: lowest 1% as outliers
threshold = np.quantile(log_dens, 0.01)
outliers = X[log_dens < threshold]

Bandwidth via cross-validation

from sklearn.model_selection import GridSearchCV

params = {"bandwidth": np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params, cv=5)
grid.fit(X)
print(grid.best_params_)

KDEpy fast FFT-based KDE

from KDEpy import FFTKDE
x_grid, y = FFTKDE(kernel="gaussian", bw="silverman").fit(x).evaluate()
# O(n + m log m) instead of O(n*m)

Adaptive bandwidth

def adaptive_kde(x, x_eval, k=10):
    from scipy.spatial import cKDTree
    tree = cKDTree(x[:, None])
    dists, _ = tree.query(x[:, None], k=k+1)
    h_local = dists[:, -1]  # k-NN distance per point
    out = np.zeros_like(x_eval)
    for xi, hi in zip(x, h_local):
        out += np.exp(-0.5*((x_eval - xi)/hi)**2) / hi
    return out / (len(x) * np.sqrt(2*np.pi))

Visualization

import seaborn as sns
sns.kdeplot(data=df, x="feature", hue="class", fill=True, common_norm=False)

매 결정 기준

상황 Method
1D, small n scipy gaussian_kde
high-D, n>10⁴ FFTKDE
streaming online KDE (Heinz 2008)
boundaries reflection / log-transform
heavy-tail adaptive bandwidth

기본값: Silverman + Gaussian kernel, then validate.

🔗 Graph

🤖 LLM 활용

언제: small/mid n, distribution shape 알 수 없을 때. 언제 X: very high-D (curse of dimensionality), n < 30.

안티패턴

  • Default bandwidth blind use: Silverman 은 Gaussian 가정 — bimodal 에 over-smooth.
  • Boundary bias 무시: support [0, ∞) 인데 Gaussian kernel 사용 → leak 발생.
  • High-D KDE: d > 6 에서는 거의 useless — vine copula 또는 normalizing flow 사용.
  • Sample size 무시: n < 50 KDE 결과는 거의 noise.

🧪 검증 / 중복

  • Verified (Silverman 1986 textbook, Wand & Jones 1995, Chen 2017 review).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — KDE math, bandwidth selection, scipy/sklearn/KDEpy