--- id: wiki-2026-0508-outlier-detection-techniques title: Outlier Detection Techniques category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Outlier Detection, Anomaly Detection, Novelty Detection] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [outlier, anomaly, isolation-forest, lof, autoencoder, statistics] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: { language: python, framework: sklearn/pyod } --- # Outlier Detection Techniques ## 한 줄 정상 분포에서 벗어난 관측치를 탐지하는 통계·거리·밀도·재구성 기반 기법 모음. ## 핵심 - **Univariate**: IQR, Z-score, modified Z (MAD-based). - **Distance/Density**: kNN, LOF (Local Outlier Factor). - **Tree**: Isolation Forest, Extended IF. - **Boundary**: One-Class SVM, SVDD. - **Reconstruction**: Autoencoder, VAE error. - **Time-series**: STL residual, Prophet, transformer-based. - **Library**: scikit-learn, **PyOD**(40+ algorithms), Alibi Detect. - 핵심 결정: contamination rate(기대 outlier 비율), univariate vs multivariate. ## 💻 패턴 ```python # 1. IQR 방법 (univariate, robust) import numpy as np import pandas as pd x = pd.Series([10, 12, 11, 13, 12, 11, 14, 200, 9, 12]) q1, q3 = x.quantile([0.25, 0.75]) iqr = q3 - q1 lo, hi = q1 - 1.5 * iqr, q3 + 1.5 * iqr mask = (x < lo) | (x > hi) print("Outliers:", x[mask].tolist()) # [200] ``` ```python # 2. Modified Z-score (MAD 기반, 견고) from scipy.stats import median_abs_deviation import numpy as np x = np.array([10, 12, 11, 13, 12, 11, 14, 200, 9, 12]) med = np.median(x) mad = median_abs_deviation(x, scale="normal") mz = 0.6745 * (x - med) / mad print("Outliers idx:", np.where(np.abs(mz) > 3.5)[0]) ``` ```python # 3. Isolation Forest (multivariate, scalable) from sklearn.ensemble import IsolationForest import numpy as np rng = np.random.RandomState(0) X = np.r_[rng.randn(200, 2), rng.uniform(-6, 6, (10, 2))] # 10 outliers clf = IsolationForest(contamination=0.05, random_state=0) y_pred = clf.fit_predict(X) # -1 outlier, 1 inlier scores = clf.score_samples(X) # 낮을수록 outlier print("Outliers:", (y_pred == -1).sum()) ``` ```python # 4. Local Outlier Factor — 지역 밀도 비교 from sklearn.neighbors import LocalOutlierFactor lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05) y_pred = lof.fit_predict(X) neg_score = lof.negative_outlier_factor_ # 더 음수 = outlier ``` ```python # 5. One-Class SVM — boundary-based from sklearn.svm import OneClassSVM from sklearn.preprocessing import StandardScaler X_s = StandardScaler().fit_transform(X) oc = OneClassSVM(nu=0.05, kernel="rbf", gamma="scale") oc.fit(X_s[:200]) # train on assumed-clean y = oc.predict(X_s) # -1 outlier ``` ```python # 6. Autoencoder reconstruction (PyTorch, deep) import torch, torch.nn as nn class AE(nn.Module): def __init__(self, d=20, h=4): super().__init__() self.enc = nn.Sequential(nn.Linear(d, 16), nn.ReLU(), nn.Linear(16, h)) self.dec = nn.Sequential(nn.Linear(h, 16), nn.ReLU(), nn.Linear(16, d)) def forward(self, x): return self.dec(self.enc(x)) X = torch.randn(1000, 20) model, opt = AE(), torch.optim.Adam(AE().parameters(), 1e-3) opt = torch.optim.Adam(model.parameters(), 1e-3) loss = nn.MSELoss(reduction="none") for epoch in range(20): recon = model(X) err = loss(recon, X).mean() opt.zero_grad(); err.backward(); opt.step() # 추론: per-sample reconstruction error with torch.no_grad(): err_per_sample = loss(model(X), X).mean(dim=1) threshold = err_per_sample.quantile(0.95) outliers = (err_per_sample > threshold) ``` ```python # 7. PyOD — 통합 라이브러리 # pip install pyod from pyod.models.ecod import ECOD # parameter-free, fast from pyod.models.iforest import IForest m = ECOD(contamination=0.05) m.fit(X.numpy()) labels = m.labels_ # 0 normal, 1 outlier scores = m.decision_scores_ ``` ```python # 8. Time-series: rolling Z + STL import statsmodels.api as sm import pandas as pd, numpy as np ts = pd.Series(np.random.randn(365) + np.sin(np.linspace(0, 6.28, 365))) ts.iloc[100] += 8 # inject anomaly res = sm.tsa.STL(ts, period=30).fit() resid_z = (res.resid - res.resid.mean()) / res.resid.std() print("Anomaly idx:", np.where(np.abs(resid_z) > 3)[0]) ``` ## 결정 기준 | 데이터 | 추천 | |---|---| | 1D, 정규에 가까움 | Z-score | | 1D, skewed/heavy-tail | IQR or modified Z(MAD) | | 다변량, 빠름·확장 | Isolation Forest | | 군집 밀도 차 큼 | LOF | | 작은 데이터, boundary | One-Class SVM | | 고차원/이미지 | Autoencoder / VAE | | 시계열 | STL residual / Prophet / TS-Transformer | | 빠른 baseline | PyOD ECOD | ## 🔗 Graph - Related: `[[Anomaly-Detection]]`, `[[Data-Cleaning]]`, ``, ``, `[[Time-Series]]`, `` ## 🤖 LLM 활용 - 로그 anomaly: embedding 후 IF/LOF로 cluster outside. - 텍스트 제보 응답을 LLM으로 분류, 빈도 outlier만 escalate. ## ❌ 안티패턴 - 평균/표준편차 기반 Z-score를 heavy-tail 데이터에 사용. - contamination rate 기본값 무비판 사용. - Outlier 무조건 제거 (rare-event 신호일 수 있음). - 학습 데이터에 outlier 섞인 채로 OneClassSVM 훈련. ## 🧪 검증 - 합성 데이터(known anomalies)로 Precision@k, ROC-AUC 평가. - contamination 변경 시 결과 안정성 sweep. - 시각화: PCA/UMAP 2D + outlier label 색상. ## 🕓 Changelog - 2026-05-08 Phase 1: 초안. - 2026-05-10 Manual cleanup: 8 패턴, PyOD/AE/STL 보강.