Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

4.2 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Kernel Density Estimation (KDE)

매 한 줄

"매 histogram 의 smooth 한 generalization". KDE 는 non-parametric density estimator 로, 매 sample point 에 kernel function 을 placing 하고 sum 하여 continuous PDF 추정. Parzen (1962) 와 Rosenblatt (1956) 이 정립했으며, 2026 modern stats/ML 에서 anomaly detection, generative sampling, visualization 에 사용.

매 핵심

매 수식

\hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n K\left(\frac{x - x_i}{h}\right)
K = kernel (Gaussian, Epanechnikov, …)
h = bandwidth (smoothing parameter)
multi-D: \hat{f}_H(x) = \frac{1}{n|H|^{1/2}}\sum K(H^{-1/2}(x-x_i))

매 Bandwidth selection

Silverman's rule: h = 1.06 \hat{\sigma} n^{-1/5}
Scott's rule: h = n^{-1/(d+4)}
cross-validation (likelihood)
plug-in estimators (Sheather-Jones)

매 응용

EDA visualization (seaborn kdeplot).
Anomaly detection (low-density = outlier).
Mode finding (mean-shift).
Bayesian non-parametric prior.
Generative sampling (smoothed bootstrap).

💻 패턴

scipy KDE

from scipy.stats import gaussian_kde
import numpy as np

x = np.random.normal(0, 1, 1000)
kde = gaussian_kde(x, bw_method="silverman")

xs = np.linspace(-4, 4, 200)
density = kde(xs)

sklearn KernelDensity

from sklearn.neighbors import KernelDensity
import numpy as np

X = np.random.randn(1000, 2)
kde = KernelDensity(kernel="gaussian", bandwidth=0.3).fit(X)
log_dens = kde.score_samples(X)  # log-density at each point

# anomaly: lowest 1% as outliers
threshold = np.quantile(log_dens, 0.01)
outliers = X[log_dens < threshold]

Bandwidth via cross-validation

from sklearn.model_selection import GridSearchCV

params = {"bandwidth": np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params, cv=5)
grid.fit(X)
print(grid.best_params_)

KDEpy fast FFT-based KDE

from KDEpy import FFTKDE
x_grid, y = FFTKDE(kernel="gaussian", bw="silverman").fit(x).evaluate()
# O(n + m log m) instead of O(n*m)

Adaptive bandwidth

def adaptive_kde(x, x_eval, k=10):
    from scipy.spatial import cKDTree
    tree = cKDTree(x[:, None])
    dists, _ = tree.query(x[:, None], k=k+1)
    h_local = dists[:, -1]  # k-NN distance per point
    out = np.zeros_like(x_eval)
    for xi, hi in zip(x, h_local):
        out += np.exp(-0.5*((x_eval - xi)/hi)**2) / hi
    return out / (len(x) * np.sqrt(2*np.pi))

Visualization

import seaborn as sns
sns.kdeplot(data=df, x="feature", hue="class", fill=True, common_norm=False)

매 결정 기준

상황	Method
1D, small n	scipy gaussian_kde
high-D, n>10⁴	FFTKDE
streaming	online KDE (Heinz 2008)
boundaries	reflection / log-transform
heavy-tail	adaptive bandwidth

기본값: Silverman + Gaussian kernel, then validate.

🔗 Graph

🤖 LLM 활용

언제: small/mid n, distribution shape 알 수 없을 때. 언제 X: very high-D (curse of dimensionality), n < 30.

❌ 안티패턴

Default bandwidth blind use: Silverman 은 Gaussian 가정 — bimodal 에 over-smooth.
Boundary bias 무시: support [0, ∞) 인데 Gaussian kernel 사용 → leak 발생.
High-D KDE: d > 6 에서는 거의 useless — vine copula 또는 normalizing flow 사용.
Sample size 무시: n < 50 KDE 결과는 거의 noise.

🧪 검증 / 중복

Verified (Silverman 1986 textbook, Wand & Jones 1995, Chen 2017 review).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — KDE math, bandwidth selection, scipy/sklearn/KDEpy

4.2 KiB Raw Blame History