2nd/10_Wiki/Topics/AI_and_ML/Outlier-Detection-Techniques.md

---
id: wiki-2026-0508-outlier-detection-techniques
title: Outlier Detection Techniques
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Outlier Detection, Anomaly Detection, Novelty Detection]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [outlier, anomaly, isolation-forest, lof, autoencoder, statistics]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack: { language: python, framework: sklearn/pyod }
---

# Outlier Detection Techniques

## 한 줄
정상 분포에서 벗어난 관측치를 탐지하는 통계·거리·밀도·재구성 기반 기법 모음.

## 핵심
- **Univariate**: IQR, Z-score, modified Z (MAD-based).
- **Distance/Density**: kNN, LOF (Local Outlier Factor).
- **Tree**: Isolation Forest, Extended IF.
- **Boundary**: One-Class SVM, SVDD.
- **Reconstruction**: Autoencoder, VAE error.
- **Time-series**: STL residual, Prophet, transformer-based.
- **Library**: scikit-learn, **PyOD**(40+ algorithms), Alibi Detect.
- 핵심 결정: contamination rate(기대 outlier 비율), univariate vs multivariate.

## 💻 패턴

```python
# 1. IQR 방법 (univariate, robust)
import numpy as np
import pandas as pd

x = pd.Series([10, 12, 11, 13, 12, 11, 14, 200, 9, 12])
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
lo, hi = q1 - 1.5 * iqr, q3 + 1.5 * iqr
mask = (x < lo) | (x > hi)
print("Outliers:", x[mask].tolist())  # [200]
```

```python
# 2. Modified Z-score (MAD 기반, 견고)
from scipy.stats import median_abs_deviation
import numpy as np

x = np.array([10, 12, 11, 13, 12, 11, 14, 200, 9, 12])
med = np.median(x)
mad = median_abs_deviation(x, scale="normal")
mz = 0.6745 * (x - med) / mad
print("Outliers idx:", np.where(np.abs(mz) > 3.5)[0])
```

```python
# 3. Isolation Forest (multivariate, scalable)
from sklearn.ensemble import IsolationForest
import numpy as np

rng = np.random.RandomState(0)
X = np.r_[rng.randn(200, 2), rng.uniform(-6, 6, (10, 2))]  # 10 outliers

clf = IsolationForest(contamination=0.05, random_state=0)
y_pred = clf.fit_predict(X)  # -1 outlier, 1 inlier
scores = clf.score_samples(X)  # 낮을수록 outlier
print("Outliers:", (y_pred == -1).sum())
```

```python
# 4. Local Outlier Factor — 지역 밀도 비교
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_pred = lof.fit_predict(X)
neg_score = lof.negative_outlier_factor_  # 더 음수 = outlier
```

```python
# 5. One-Class SVM — boundary-based
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler

X_s = StandardScaler().fit_transform(X)
oc = OneClassSVM(nu=0.05, kernel="rbf", gamma="scale")
oc.fit(X_s[:200])  # train on assumed-clean
y = oc.predict(X_s)  # -1 outlier
```

```python
# 6. Autoencoder reconstruction (PyTorch, deep)
import torch, torch.nn as nn

class AE(nn.Module):
    def __init__(self, d=20, h=4):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d, 16), nn.ReLU(), nn.Linear(16, h))
        self.dec = nn.Sequential(nn.Linear(h, 16), nn.ReLU(), nn.Linear(16, d))
    def forward(self, x):
        return self.dec(self.enc(x))

X = torch.randn(1000, 20)
model, opt = AE(), torch.optim.Adam(AE().parameters(), 1e-3)
opt = torch.optim.Adam(model.parameters(), 1e-3)
loss = nn.MSELoss(reduction="none")

for epoch in range(20):
    recon = model(X)
    err = loss(recon, X).mean()
    opt.zero_grad(); err.backward(); opt.step()

# 추론: per-sample reconstruction error
with torch.no_grad():
    err_per_sample = loss(model(X), X).mean(dim=1)
threshold = err_per_sample.quantile(0.95)
outliers = (err_per_sample > threshold)
```

```python
# 7. PyOD — 통합 라이브러리
# pip install pyod
from pyod.models.ecod import ECOD       # parameter-free, fast
from pyod.models.iforest import IForest

m = ECOD(contamination=0.05)
m.fit(X.numpy())
labels = m.labels_         # 0 normal, 1 outlier
scores = m.decision_scores_
```

```python
# 8. Time-series: rolling Z + STL
import statsmodels.api as sm
import pandas as pd, numpy as np

ts = pd.Series(np.random.randn(365) + np.sin(np.linspace(0, 6.28, 365)))
ts.iloc[100] += 8  # inject anomaly
res = sm.tsa.STL(ts, period=30).fit()
resid_z = (res.resid - res.resid.mean()) / res.resid.std()
print("Anomaly idx:", np.where(np.abs(resid_z) > 3)[0])
```

## 결정 기준

| 데이터 | 추천 |
|---|---|
| 1D, 정규에 가까움 | Z-score |
| 1D, skewed/heavy-tail | IQR or modified Z(MAD) |
| 다변량, 빠름·확장 | Isolation Forest |
| 군집 밀도 차 큼 | LOF |
| 작은 데이터, boundary | One-Class SVM |
| 고차원/이미지 | Autoencoder / VAE |
| 시계열 | STL residual / Prophet / TS-Transformer |
| 빠른 baseline | PyOD ECOD |

## 🔗 Graph
- Related: `[[Anomaly-Detection]]`, `[[Data-Cleaning]]`, ``, ``, `[[Time-Series]]`, ``

## 🤖 LLM 활용
- 로그 anomaly: embedding 후 IF/LOF로 cluster outside.
- 텍스트 제보 응답을 LLM으로 분류, 빈도 outlier만 escalate.

## ❌ 안티패턴
- 평균/표준편차 기반 Z-score를 heavy-tail 데이터에 사용.
- contamination rate 기본값 무비판 사용.
- Outlier 무조건 제거 (rare-event 신호일 수 있음).
- 학습 데이터에 outlier 섞인 채로 OneClassSVM 훈련.

## 🧪 검증
- 합성 데이터(known anomalies)로 Precision@k, ROC-AUC 평가.
- contamination 변경 시 결과 안정성 sweep.
- 시각화: PCA/UMAP 2D + outlier label 색상.

## 🕓 Changelog
- 2026-05-08 Phase 1: 초안.
- 2026-05-10 Manual cleanup: 8 패턴, PyOD/AE/STL 보강.