Files
2nd/10_Wiki/Topics/AI_and_ML/Imbalanced-Data-Handling.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.7 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-imbalanced-data-handling Imbalanced Data Handling 10_Wiki/Topics verified self
imbalanced data
class imbalance
SMOTE
oversampling
undersampling
class weight
none A 0.95 applied
machine-learning
imbalanced
smote
oversampling
class-weight
fraud
2026-05-10 pending
language framework
Python imbalanced-learn / scikit-learn

Imbalanced Data Handling

매 한 줄

"매 class distribution 의 의 의 imbalance — 매 majority dominate". 매 fraud, 매 medical rare disease, 매 anomaly 의 common. 매 method: 매 oversample (SMOTE), undersample, class weight, focal loss, threshold tune. 매 evaluation: 매 accuracy 의 useless — PR-AUC, F1, MCC.

매 핵심

매 method

  • Resampling:
    • Random oversample (minority).
    • SMOTE: 매 synthetic minority.
    • ADASYN: adaptive.
    • Random undersample (majority).
    • Tomek links: 매 boundary clean.
  • Class weight.
  • Loss-based: focal loss, weighted CE.
  • Threshold tuning: 매 default 0.5 의 X.
  • Anomaly detection (1-class).
  • Cost-sensitive learning.

매 metric

  • PR-AUC (Average Precision).
  • F1 / Macro-F1.
  • MCC (Matthews Correlation Coefficient).
  • Cohen's κ.
  • Confusion matrix.

매 응용

  1. Fraud detection.
  2. Medical (rare disease).
  3. Anomaly detection.
  4. Customer churn.
  5. Click prediction.

💻 패턴

Class weight (sklearn)

from sklearn.utils import class_weight
weights = class_weight.compute_class_weight('balanced', classes=np.unique(y), y=y)
weight_dict = dict(zip(np.unique(y), weights))

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight=weight_dict).fit(X, y)

SMOTE (imbalanced-learn)

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=0)
X_res, y_res = sm.fit_resample(X, y)

SMOTE-NC (mixed numerical + categorical)

from imblearn.over_sampling import SMOTENC
smnc = SMOTENC(categorical_features=[0, 2, 5], random_state=0)
X_res, y_res = smnc.fit_resample(X, y)

Random undersample

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=0)
X_res, y_res = rus.fit_resample(X, y)

Combined (SMOTE + Tomek)

from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=0)
X_res, y_res = smt.fit_resample(X, y)

Pipeline (avoid leakage)

from imblearn.pipeline import Pipeline as ImbPipeline
pipe = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE()),
    ('clf', LogisticRegression()),
])
# 매 SMOTE applied 매 매 fold (CV-safe)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring='f1')

Focal loss (PyTorch)

import torch.nn.functional as F
def focal_loss(logits, targets, alpha=0.25, gamma=2.0):
    p = torch.sigmoid(logits)
    p_t = p * targets + (1 - p) * (1 - targets)
    alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
    bce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
    return (alpha_t * (1 - p_t) ** gamma * bce).mean()

Threshold tuning

from sklearn.metrics import precision_recall_curve
y_score = model.predict_proba(X_val)[:, 1]
prec, rec, thr = precision_recall_curve(y_val, y_score)
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_thr = thr[f1_scores.argmax()]
y_pred = (y_score > best_thr).astype(int)

XGBoost scale_pos_weight

import xgboost as xgb
ratio = sum(y == 0) / sum(y == 1)
model = xgb.XGBClassifier(scale_pos_weight=ratio).fit(X, y)

Eval metrics (proper)

from sklearn.metrics import classification_report, average_precision_score, matthews_corrcoef, confusion_matrix
print(classification_report(y_val, y_pred))
print(f'PR-AUC: {average_precision_score(y_val, y_score):.3f}')
print(f'MCC: {matthews_corrcoef(y_val, y_pred):.3f}')
print(confusion_matrix(y_val, y_pred))

Cost-sensitive learning

COST_MATRIX = np.array([[0, 1], [10, 0]])  # 매 FN cost = 10x FP

def cost_sensitive_predict(probs, cost):
    expected_cost = probs @ cost
    return expected_cost.argmin(axis=1)

One-class anomaly

from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01).fit(X_majority)
anomalies = iso.predict(X_test) == -1

Weighted sampler (PyTorch)

from torch.utils.data import WeightedRandomSampler
class_counts = [sum(y == c) for c in np.unique(y)]
weights = [1.0 / class_counts[label] for label in y]
sampler = WeightedRandomSampler(weights, num_samples=len(y), replacement=True)
loader = DataLoader(dataset, batch_size=32, sampler=sampler)

Stratified split

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_idx, val_idx in skf.split(X, y):
    train_fold(X[train_idx], y[train_idx])

Borderline-SMOTE

from imblearn.over_sampling import BorderlineSMOTE
bsm = BorderlineSMOTE(random_state=0)
X_res, y_res = bsm.fit_resample(X, y)

Calibration check after handling

from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_val, y_score, n_bins=10)
# 매 oversampling 매 의 의 calibration 의 distort

매 결정 기준

상황 Approach
Mild (< 1:10) Class weight
Moderate (1:10-1:100) SMOTE / class weight
Severe (> 1:100) Anomaly detection / focal
Tabular XGBoost scale_pos_weight
DL Focal loss + weighted sampler
Cost varies Cost-sensitive

기본값: 매 class weight + 매 threshold tune + 매 PR-AUC eval. 매 severe = focal + anomaly detection 의 explore. 매 SMOTE 는 careful (calibration distort).

🔗 Graph

🤖 LLM 활용

언제: 매 fraud / medical / churn / anomaly. 언제 X: 매 balanced.

안티패턴

  • Accuracy metric on imbalanced: 매 misleading.
  • SMOTE before train/val split: 매 leakage.
  • No threshold tune: 매 default 0.5 의 wrong.
  • Aggressive oversample: 매 calibration 의 break.
  • Ignore minority cost: 매 FN expensive.

🧪 검증 / 중복

  • Verified (Chawla SMOTE 2002, He & Garcia review 2009, Lin focal 2017).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — methods + 매 SMOTE / focal / threshold / scale_pos_weight code