f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
182 lines
5.5 KiB
Markdown
182 lines
5.5 KiB
Markdown
---
|
|
id: wiki-2026-0508-outlier-detection-techniques
|
|
title: Outlier Detection Techniques
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Outlier Detection, Anomaly Detection, Novelty Detection]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [outlier, anomaly, isolation-forest, lof, autoencoder, statistics]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack: { language: python, framework: sklearn/pyod }
|
|
---
|
|
|
|
# Outlier Detection Techniques
|
|
|
|
## 한 줄
|
|
정상 분포에서 벗어난 관측치를 탐지하는 통계·거리·밀도·재구성 기반 기법 모음.
|
|
|
|
## 핵심
|
|
- **Univariate**: IQR, Z-score, modified Z (MAD-based).
|
|
- **Distance/Density**: kNN, LOF (Local Outlier Factor).
|
|
- **Tree**: Isolation Forest, Extended IF.
|
|
- **Boundary**: One-Class SVM, SVDD.
|
|
- **Reconstruction**: Autoencoder, VAE error.
|
|
- **Time-series**: STL residual, Prophet, transformer-based.
|
|
- **Library**: scikit-learn, **PyOD**(40+ algorithms), Alibi Detect.
|
|
- 핵심 결정: contamination rate(기대 outlier 비율), univariate vs multivariate.
|
|
|
|
## 💻 패턴
|
|
|
|
```python
|
|
# 1. IQR 방법 (univariate, robust)
|
|
import numpy as np
|
|
import pandas as pd
|
|
|
|
x = pd.Series([10, 12, 11, 13, 12, 11, 14, 200, 9, 12])
|
|
q1, q3 = x.quantile([0.25, 0.75])
|
|
iqr = q3 - q1
|
|
lo, hi = q1 - 1.5 * iqr, q3 + 1.5 * iqr
|
|
mask = (x < lo) | (x > hi)
|
|
print("Outliers:", x[mask].tolist()) # [200]
|
|
```
|
|
|
|
```python
|
|
# 2. Modified Z-score (MAD 기반, 견고)
|
|
from scipy.stats import median_abs_deviation
|
|
import numpy as np
|
|
|
|
x = np.array([10, 12, 11, 13, 12, 11, 14, 200, 9, 12])
|
|
med = np.median(x)
|
|
mad = median_abs_deviation(x, scale="normal")
|
|
mz = 0.6745 * (x - med) / mad
|
|
print("Outliers idx:", np.where(np.abs(mz) > 3.5)[0])
|
|
```
|
|
|
|
```python
|
|
# 3. Isolation Forest (multivariate, scalable)
|
|
from sklearn.ensemble import IsolationForest
|
|
import numpy as np
|
|
|
|
rng = np.random.RandomState(0)
|
|
X = np.r_[rng.randn(200, 2), rng.uniform(-6, 6, (10, 2))] # 10 outliers
|
|
|
|
clf = IsolationForest(contamination=0.05, random_state=0)
|
|
y_pred = clf.fit_predict(X) # -1 outlier, 1 inlier
|
|
scores = clf.score_samples(X) # 낮을수록 outlier
|
|
print("Outliers:", (y_pred == -1).sum())
|
|
```
|
|
|
|
```python
|
|
# 4. Local Outlier Factor — 지역 밀도 비교
|
|
from sklearn.neighbors import LocalOutlierFactor
|
|
|
|
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
|
|
y_pred = lof.fit_predict(X)
|
|
neg_score = lof.negative_outlier_factor_ # 더 음수 = outlier
|
|
```
|
|
|
|
```python
|
|
# 5. One-Class SVM — boundary-based
|
|
from sklearn.svm import OneClassSVM
|
|
from sklearn.preprocessing import StandardScaler
|
|
|
|
X_s = StandardScaler().fit_transform(X)
|
|
oc = OneClassSVM(nu=0.05, kernel="rbf", gamma="scale")
|
|
oc.fit(X_s[:200]) # train on assumed-clean
|
|
y = oc.predict(X_s) # -1 outlier
|
|
```
|
|
|
|
```python
|
|
# 6. Autoencoder reconstruction (PyTorch, deep)
|
|
import torch, torch.nn as nn
|
|
|
|
class AE(nn.Module):
|
|
def __init__(self, d=20, h=4):
|
|
super().__init__()
|
|
self.enc = nn.Sequential(nn.Linear(d, 16), nn.ReLU(), nn.Linear(16, h))
|
|
self.dec = nn.Sequential(nn.Linear(h, 16), nn.ReLU(), nn.Linear(16, d))
|
|
def forward(self, x):
|
|
return self.dec(self.enc(x))
|
|
|
|
X = torch.randn(1000, 20)
|
|
model, opt = AE(), torch.optim.Adam(AE().parameters(), 1e-3)
|
|
opt = torch.optim.Adam(model.parameters(), 1e-3)
|
|
loss = nn.MSELoss(reduction="none")
|
|
|
|
for epoch in range(20):
|
|
recon = model(X)
|
|
err = loss(recon, X).mean()
|
|
opt.zero_grad(); err.backward(); opt.step()
|
|
|
|
# 추론: per-sample reconstruction error
|
|
with torch.no_grad():
|
|
err_per_sample = loss(model(X), X).mean(dim=1)
|
|
threshold = err_per_sample.quantile(0.95)
|
|
outliers = (err_per_sample > threshold)
|
|
```
|
|
|
|
```python
|
|
# 7. PyOD — 통합 라이브러리
|
|
# pip install pyod
|
|
from pyod.models.ecod import ECOD # parameter-free, fast
|
|
from pyod.models.iforest import IForest
|
|
|
|
m = ECOD(contamination=0.05)
|
|
m.fit(X.numpy())
|
|
labels = m.labels_ # 0 normal, 1 outlier
|
|
scores = m.decision_scores_
|
|
```
|
|
|
|
```python
|
|
# 8. Time-series: rolling Z + STL
|
|
import statsmodels.api as sm
|
|
import pandas as pd, numpy as np
|
|
|
|
ts = pd.Series(np.random.randn(365) + np.sin(np.linspace(0, 6.28, 365)))
|
|
ts.iloc[100] += 8 # inject anomaly
|
|
res = sm.tsa.STL(ts, period=30).fit()
|
|
resid_z = (res.resid - res.resid.mean()) / res.resid.std()
|
|
print("Anomaly idx:", np.where(np.abs(resid_z) > 3)[0])
|
|
```
|
|
|
|
## 결정 기준
|
|
|
|
| 데이터 | 추천 |
|
|
|---|---|
|
|
| 1D, 정규에 가까움 | Z-score |
|
|
| 1D, skewed/heavy-tail | IQR or modified Z(MAD) |
|
|
| 다변량, 빠름·확장 | Isolation Forest |
|
|
| 군집 밀도 차 큼 | LOF |
|
|
| 작은 데이터, boundary | One-Class SVM |
|
|
| 고차원/이미지 | Autoencoder / VAE |
|
|
| 시계열 | STL residual / Prophet / TS-Transformer |
|
|
| 빠른 baseline | PyOD ECOD |
|
|
|
|
## 🔗 Graph
|
|
- Related: `[[Anomaly-Detection]]`, `[[Data-Cleaning]]`, ``, ``, `[[Time-Series]]`, ``
|
|
|
|
## 🤖 LLM 활용
|
|
- 로그 anomaly: embedding 후 IF/LOF로 cluster outside.
|
|
- 텍스트 제보 응답을 LLM으로 분류, 빈도 outlier만 escalate.
|
|
|
|
## ❌ 안티패턴
|
|
- 평균/표준편차 기반 Z-score를 heavy-tail 데이터에 사용.
|
|
- contamination rate 기본값 무비판 사용.
|
|
- Outlier 무조건 제거 (rare-event 신호일 수 있음).
|
|
- 학습 데이터에 outlier 섞인 채로 OneClassSVM 훈련.
|
|
|
|
## 🧪 검증
|
|
- 합성 데이터(known anomalies)로 Precision@k, ROC-AUC 평가.
|
|
- contamination 변경 시 결과 안정성 sweep.
|
|
- 시각화: PCA/UMAP 2D + outlier label 색상.
|
|
|
|
## 🕓 Changelog
|
|
- 2026-05-08 Phase 1: 초안.
|
|
- 2026-05-10 Manual cleanup: 8 패턴, PyOD/AE/STL 보강.
|