f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.5 KiB
5.5 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-outlier-detection-techniques | Outlier Detection Techniques | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
Outlier Detection Techniques
한 줄
정상 분포에서 벗어난 관측치를 탐지하는 통계·거리·밀도·재구성 기반 기법 모음.
핵심
- Univariate: IQR, Z-score, modified Z (MAD-based).
- Distance/Density: kNN, LOF (Local Outlier Factor).
- Tree: Isolation Forest, Extended IF.
- Boundary: One-Class SVM, SVDD.
- Reconstruction: Autoencoder, VAE error.
- Time-series: STL residual, Prophet, transformer-based.
- Library: scikit-learn, PyOD(40+ algorithms), Alibi Detect.
- 핵심 결정: contamination rate(기대 outlier 비율), univariate vs multivariate.
💻 패턴
# 1. IQR 방법 (univariate, robust)
import numpy as np
import pandas as pd
x = pd.Series([10, 12, 11, 13, 12, 11, 14, 200, 9, 12])
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
lo, hi = q1 - 1.5 * iqr, q3 + 1.5 * iqr
mask = (x < lo) | (x > hi)
print("Outliers:", x[mask].tolist()) # [200]
# 2. Modified Z-score (MAD 기반, 견고)
from scipy.stats import median_abs_deviation
import numpy as np
x = np.array([10, 12, 11, 13, 12, 11, 14, 200, 9, 12])
med = np.median(x)
mad = median_abs_deviation(x, scale="normal")
mz = 0.6745 * (x - med) / mad
print("Outliers idx:", np.where(np.abs(mz) > 3.5)[0])
# 3. Isolation Forest (multivariate, scalable)
from sklearn.ensemble import IsolationForest
import numpy as np
rng = np.random.RandomState(0)
X = np.r_[rng.randn(200, 2), rng.uniform(-6, 6, (10, 2))] # 10 outliers
clf = IsolationForest(contamination=0.05, random_state=0)
y_pred = clf.fit_predict(X) # -1 outlier, 1 inlier
scores = clf.score_samples(X) # 낮을수록 outlier
print("Outliers:", (y_pred == -1).sum())
# 4. Local Outlier Factor — 지역 밀도 비교
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_pred = lof.fit_predict(X)
neg_score = lof.negative_outlier_factor_ # 더 음수 = outlier
# 5. One-Class SVM — boundary-based
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
X_s = StandardScaler().fit_transform(X)
oc = OneClassSVM(nu=0.05, kernel="rbf", gamma="scale")
oc.fit(X_s[:200]) # train on assumed-clean
y = oc.predict(X_s) # -1 outlier
# 6. Autoencoder reconstruction (PyTorch, deep)
import torch, torch.nn as nn
class AE(nn.Module):
def __init__(self, d=20, h=4):
super().__init__()
self.enc = nn.Sequential(nn.Linear(d, 16), nn.ReLU(), nn.Linear(16, h))
self.dec = nn.Sequential(nn.Linear(h, 16), nn.ReLU(), nn.Linear(16, d))
def forward(self, x):
return self.dec(self.enc(x))
X = torch.randn(1000, 20)
model, opt = AE(), torch.optim.Adam(AE().parameters(), 1e-3)
opt = torch.optim.Adam(model.parameters(), 1e-3)
loss = nn.MSELoss(reduction="none")
for epoch in range(20):
recon = model(X)
err = loss(recon, X).mean()
opt.zero_grad(); err.backward(); opt.step()
# 추론: per-sample reconstruction error
with torch.no_grad():
err_per_sample = loss(model(X), X).mean(dim=1)
threshold = err_per_sample.quantile(0.95)
outliers = (err_per_sample > threshold)
# 7. PyOD — 통합 라이브러리
# pip install pyod
from pyod.models.ecod import ECOD # parameter-free, fast
from pyod.models.iforest import IForest
m = ECOD(contamination=0.05)
m.fit(X.numpy())
labels = m.labels_ # 0 normal, 1 outlier
scores = m.decision_scores_
# 8. Time-series: rolling Z + STL
import statsmodels.api as sm
import pandas as pd, numpy as np
ts = pd.Series(np.random.randn(365) + np.sin(np.linspace(0, 6.28, 365)))
ts.iloc[100] += 8 # inject anomaly
res = sm.tsa.STL(ts, period=30).fit()
resid_z = (res.resid - res.resid.mean()) / res.resid.std()
print("Anomaly idx:", np.where(np.abs(resid_z) > 3)[0])
결정 기준
| 데이터 | 추천 |
|---|---|
| 1D, 정규에 가까움 | Z-score |
| 1D, skewed/heavy-tail | IQR or modified Z(MAD) |
| 다변량, 빠름·확장 | Isolation Forest |
| 군집 밀도 차 큼 | LOF |
| 작은 데이터, boundary | One-Class SVM |
| 고차원/이미지 | Autoencoder / VAE |
| 시계열 | STL residual / Prophet / TS-Transformer |
| 빠른 baseline | PyOD ECOD |
🔗 Graph
- Related:
[[Anomaly-Detection]],[[Data-Cleaning]],,,[[Time-Series]], ``
🤖 LLM 활용
- 로그 anomaly: embedding 후 IF/LOF로 cluster outside.
- 텍스트 제보 응답을 LLM으로 분류, 빈도 outlier만 escalate.
❌ 안티패턴
- 평균/표준편차 기반 Z-score를 heavy-tail 데이터에 사용.
- contamination rate 기본값 무비판 사용.
- Outlier 무조건 제거 (rare-event 신호일 수 있음).
- 학습 데이터에 outlier 섞인 채로 OneClassSVM 훈련.
🧪 검증
- 합성 데이터(known anomalies)로 Precision@k, ROC-AUC 평가.
- contamination 변경 시 결과 안정성 sweep.
- 시각화: PCA/UMAP 2D + outlier label 색상.
🕓 Changelog
- 2026-05-08 Phase 1: 초안.
- 2026-05-10 Manual cleanup: 8 패턴, PyOD/AE/STL 보강.