Files
2nd/10_Wiki/Topics/AI_and_ML/Outlier-Detection-Techniques.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.5 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-outlier-detection-techniques Outlier Detection Techniques 10_Wiki/Topics verified self
Outlier Detection
Anomaly Detection
Novelty Detection
none A 0.9 applied
outlier
anomaly
isolation-forest
lof
autoencoder
statistics
2026-05-10 pending
language framework
python sklearn/pyod

Outlier Detection Techniques

한 줄

정상 분포에서 벗어난 관측치를 탐지하는 통계·거리·밀도·재구성 기반 기법 모음.

핵심

  • Univariate: IQR, Z-score, modified Z (MAD-based).
  • Distance/Density: kNN, LOF (Local Outlier Factor).
  • Tree: Isolation Forest, Extended IF.
  • Boundary: One-Class SVM, SVDD.
  • Reconstruction: Autoencoder, VAE error.
  • Time-series: STL residual, Prophet, transformer-based.
  • Library: scikit-learn, PyOD(40+ algorithms), Alibi Detect.
  • 핵심 결정: contamination rate(기대 outlier 비율), univariate vs multivariate.

💻 패턴

# 1. IQR 방법 (univariate, robust)
import numpy as np
import pandas as pd

x = pd.Series([10, 12, 11, 13, 12, 11, 14, 200, 9, 12])
q1, q3 = x.quantile([0.25, 0.75])
iqr = q3 - q1
lo, hi = q1 - 1.5 * iqr, q3 + 1.5 * iqr
mask = (x < lo) | (x > hi)
print("Outliers:", x[mask].tolist())  # [200]
# 2. Modified Z-score (MAD 기반, 견고)
from scipy.stats import median_abs_deviation
import numpy as np

x = np.array([10, 12, 11, 13, 12, 11, 14, 200, 9, 12])
med = np.median(x)
mad = median_abs_deviation(x, scale="normal")
mz = 0.6745 * (x - med) / mad
print("Outliers idx:", np.where(np.abs(mz) > 3.5)[0])
# 3. Isolation Forest (multivariate, scalable)
from sklearn.ensemble import IsolationForest
import numpy as np

rng = np.random.RandomState(0)
X = np.r_[rng.randn(200, 2), rng.uniform(-6, 6, (10, 2))]  # 10 outliers

clf = IsolationForest(contamination=0.05, random_state=0)
y_pred = clf.fit_predict(X)  # -1 outlier, 1 inlier
scores = clf.score_samples(X)  # 낮을수록 outlier
print("Outliers:", (y_pred == -1).sum())
# 4. Local Outlier Factor — 지역 밀도 비교
from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
y_pred = lof.fit_predict(X)
neg_score = lof.negative_outlier_factor_  # 더 음수 = outlier
# 5. One-Class SVM — boundary-based
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler

X_s = StandardScaler().fit_transform(X)
oc = OneClassSVM(nu=0.05, kernel="rbf", gamma="scale")
oc.fit(X_s[:200])  # train on assumed-clean
y = oc.predict(X_s)  # -1 outlier
# 6. Autoencoder reconstruction (PyTorch, deep)
import torch, torch.nn as nn

class AE(nn.Module):
    def __init__(self, d=20, h=4):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d, 16), nn.ReLU(), nn.Linear(16, h))
        self.dec = nn.Sequential(nn.Linear(h, 16), nn.ReLU(), nn.Linear(16, d))
    def forward(self, x):
        return self.dec(self.enc(x))

X = torch.randn(1000, 20)
model, opt = AE(), torch.optim.Adam(AE().parameters(), 1e-3)
opt = torch.optim.Adam(model.parameters(), 1e-3)
loss = nn.MSELoss(reduction="none")

for epoch in range(20):
    recon = model(X)
    err = loss(recon, X).mean()
    opt.zero_grad(); err.backward(); opt.step()

# 추론: per-sample reconstruction error
with torch.no_grad():
    err_per_sample = loss(model(X), X).mean(dim=1)
threshold = err_per_sample.quantile(0.95)
outliers = (err_per_sample > threshold)
# 7. PyOD — 통합 라이브러리
# pip install pyod
from pyod.models.ecod import ECOD       # parameter-free, fast
from pyod.models.iforest import IForest

m = ECOD(contamination=0.05)
m.fit(X.numpy())
labels = m.labels_         # 0 normal, 1 outlier
scores = m.decision_scores_
# 8. Time-series: rolling Z + STL
import statsmodels.api as sm
import pandas as pd, numpy as np

ts = pd.Series(np.random.randn(365) + np.sin(np.linspace(0, 6.28, 365)))
ts.iloc[100] += 8  # inject anomaly
res = sm.tsa.STL(ts, period=30).fit()
resid_z = (res.resid - res.resid.mean()) / res.resid.std()
print("Anomaly idx:", np.where(np.abs(resid_z) > 3)[0])

결정 기준

데이터 추천
1D, 정규에 가까움 Z-score
1D, skewed/heavy-tail IQR or modified Z(MAD)
다변량, 빠름·확장 Isolation Forest
군집 밀도 차 큼 LOF
작은 데이터, boundary One-Class SVM
고차원/이미지 Autoencoder / VAE
시계열 STL residual / Prophet / TS-Transformer
빠른 baseline PyOD ECOD

🔗 Graph

  • Related: [[Anomaly-Detection]], [[Data-Cleaning]], , , [[Time-Series]], ``

🤖 LLM 활용

  • 로그 anomaly: embedding 후 IF/LOF로 cluster outside.
  • 텍스트 제보 응답을 LLM으로 분류, 빈도 outlier만 escalate.

안티패턴

  • 평균/표준편차 기반 Z-score를 heavy-tail 데이터에 사용.
  • contamination rate 기본값 무비판 사용.
  • Outlier 무조건 제거 (rare-event 신호일 수 있음).
  • 학습 데이터에 outlier 섞인 채로 OneClassSVM 훈련.

🧪 검증

  • 합성 데이터(known anomalies)로 Precision@k, ROC-AUC 평가.
  • contamination 변경 시 결과 안정성 sweep.
  • 시각화: PCA/UMAP 2D + outlier label 색상.

🕓 Changelog

  • 2026-05-08 Phase 1: 초안.
  • 2026-05-10 Manual cleanup: 8 패턴, PyOD/AE/STL 보강.