--- id: wiki-2026-0508-imbalanced-data-handling title: Imbalanced Data Handling category: 10_Wiki/Topics status: verified canonical_id: self aliases: [imbalanced data, class imbalance, SMOTE, oversampling, undersampling, class weight] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [machine-learning, imbalanced, smote, oversampling, class-weight, fraud] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: imbalanced-learn / scikit-learn --- # Imbalanced Data Handling ## 매 한 줄 > **"매 class distribution 의 의 의 imbalance — 매 majority dominate"**. 매 fraud, 매 medical rare disease, 매 anomaly 의 common. 매 method: 매 oversample (SMOTE), undersample, class weight, focal loss, threshold tune. 매 evaluation: 매 accuracy 의 useless — PR-AUC, F1, MCC. ## 매 핵심 ### 매 method - **Resampling**: - **Random oversample** (minority). - **SMOTE**: 매 synthetic minority. - **ADASYN**: adaptive. - **Random undersample** (majority). - **Tomek links**: 매 boundary clean. - **Class weight**. - **Loss-based**: focal loss, weighted CE. - **Threshold tuning**: 매 default 0.5 의 X. - **Anomaly detection** (1-class). - **Cost-sensitive learning**. ### 매 metric - **PR-AUC** (Average Precision). - **F1** / Macro-F1. - **MCC** (Matthews Correlation Coefficient). - **Cohen's κ**. - **Confusion matrix**. ### 매 응용 1. **Fraud detection**. 2. **Medical** (rare disease). 3. **Anomaly detection**. 4. **Customer churn**. 5. **Click prediction**. ## 💻 패턴 ### Class weight (sklearn) ```python from sklearn.utils import class_weight weights = class_weight.compute_class_weight('balanced', classes=np.unique(y), y=y) weight_dict = dict(zip(np.unique(y), weights)) from sklearn.linear_model import LogisticRegression model = LogisticRegression(class_weight=weight_dict).fit(X, y) ``` ### SMOTE (imbalanced-learn) ```python from imblearn.over_sampling import SMOTE sm = SMOTE(random_state=0) X_res, y_res = sm.fit_resample(X, y) ``` ### SMOTE-NC (mixed numerical + categorical) ```python from imblearn.over_sampling import SMOTENC smnc = SMOTENC(categorical_features=[0, 2, 5], random_state=0) X_res, y_res = smnc.fit_resample(X, y) ``` ### Random undersample ```python from imblearn.under_sampling import RandomUnderSampler rus = RandomUnderSampler(sampling_strategy=0.5, random_state=0) X_res, y_res = rus.fit_resample(X, y) ``` ### Combined (SMOTE + Tomek) ```python from imblearn.combine import SMOTETomek smt = SMOTETomek(random_state=0) X_res, y_res = smt.fit_resample(X, y) ``` ### Pipeline (avoid leakage) ```python from imblearn.pipeline import Pipeline as ImbPipeline pipe = ImbPipeline([ ('scaler', StandardScaler()), ('smote', SMOTE()), ('clf', LogisticRegression()), ]) # 매 SMOTE applied 매 매 fold (CV-safe) from sklearn.model_selection import cross_val_score scores = cross_val_score(pipe, X, y, cv=5, scoring='f1') ``` ### Focal loss (PyTorch) ```python import torch.nn.functional as F def focal_loss(logits, targets, alpha=0.25, gamma=2.0): p = torch.sigmoid(logits) p_t = p * targets + (1 - p) * (1 - targets) alpha_t = alpha * targets + (1 - alpha) * (1 - targets) bce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none') return (alpha_t * (1 - p_t) ** gamma * bce).mean() ``` ### Threshold tuning ```python from sklearn.metrics import precision_recall_curve y_score = model.predict_proba(X_val)[:, 1] prec, rec, thr = precision_recall_curve(y_val, y_score) f1_scores = 2 * prec * rec / (prec + rec + 1e-9) best_thr = thr[f1_scores.argmax()] y_pred = (y_score > best_thr).astype(int) ``` ### XGBoost scale_pos_weight ```python import xgboost as xgb ratio = sum(y == 0) / sum(y == 1) model = xgb.XGBClassifier(scale_pos_weight=ratio).fit(X, y) ``` ### Eval metrics (proper) ```python from sklearn.metrics import classification_report, average_precision_score, matthews_corrcoef, confusion_matrix print(classification_report(y_val, y_pred)) print(f'PR-AUC: {average_precision_score(y_val, y_score):.3f}') print(f'MCC: {matthews_corrcoef(y_val, y_pred):.3f}') print(confusion_matrix(y_val, y_pred)) ``` ### Cost-sensitive learning ```python COST_MATRIX = np.array([[0, 1], [10, 0]]) # 매 FN cost = 10x FP def cost_sensitive_predict(probs, cost): expected_cost = probs @ cost return expected_cost.argmin(axis=1) ``` ### One-class anomaly ```python from sklearn.ensemble import IsolationForest iso = IsolationForest(contamination=0.01).fit(X_majority) anomalies = iso.predict(X_test) == -1 ``` ### Weighted sampler (PyTorch) ```python from torch.utils.data import WeightedRandomSampler class_counts = [sum(y == c) for c in np.unique(y)] weights = [1.0 / class_counts[label] for label in y] sampler = WeightedRandomSampler(weights, num_samples=len(y), replacement=True) loader = DataLoader(dataset, batch_size=32, sampler=sampler) ``` ### Stratified split ```python from sklearn.model_selection import StratifiedKFold skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) for train_idx, val_idx in skf.split(X, y): train_fold(X[train_idx], y[train_idx]) ``` ### Borderline-SMOTE ```python from imblearn.over_sampling import BorderlineSMOTE bsm = BorderlineSMOTE(random_state=0) X_res, y_res = bsm.fit_resample(X, y) ``` ### Calibration check after handling ```python from sklearn.calibration import calibration_curve prob_true, prob_pred = calibration_curve(y_val, y_score, n_bins=10) # 매 oversampling 매 의 의 calibration 의 distort ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Mild (< 1:10) | Class weight | | Moderate (1:10-1:100) | SMOTE / class weight | | Severe (> 1:100) | Anomaly detection / focal | | Tabular | XGBoost scale_pos_weight | | DL | Focal loss + weighted sampler | | Cost varies | Cost-sensitive | **기본값**: 매 class weight + 매 threshold tune + 매 PR-AUC eval. 매 severe = focal + anomaly detection 의 explore. 매 SMOTE 는 careful (calibration distort). ## 🔗 Graph - 부모: [[Machine-Learning]] · [[Data-Preprocessing]] - 변형: [[SMOTE]] · [[Focal-Loss]] - 응용: [[Anomaly-Detection]] ## 🤖 LLM 활용 **언제**: 매 fraud / medical / churn / anomaly. **언제 X**: 매 balanced. ## ❌ 안티패턴 - **Accuracy metric on imbalanced**: 매 misleading. - **SMOTE before train/val split**: 매 leakage. - **No threshold tune**: 매 default 0.5 의 wrong. - **Aggressive oversample**: 매 calibration 의 break. - **Ignore minority cost**: 매 FN expensive. ## 🧪 검증 / 중복 - Verified (Chawla SMOTE 2002, He & Garcia review 2009, Lin focal 2017). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — methods + 매 SMOTE / focal / threshold / scale_pos_weight code |