f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.7 KiB
6.7 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-imbalanced-data-handling | Imbalanced Data Handling | 10_Wiki/Topics | verified | self |
|
none | A | 0.95 | applied |
|
2026-05-10 | pending |
|
Imbalanced Data Handling
매 한 줄
"매 class distribution 의 의 의 imbalance — 매 majority dominate". 매 fraud, 매 medical rare disease, 매 anomaly 의 common. 매 method: 매 oversample (SMOTE), undersample, class weight, focal loss, threshold tune. 매 evaluation: 매 accuracy 의 useless — PR-AUC, F1, MCC.
매 핵심
매 method
- Resampling:
- Random oversample (minority).
- SMOTE: 매 synthetic minority.
- ADASYN: adaptive.
- Random undersample (majority).
- Tomek links: 매 boundary clean.
- Class weight.
- Loss-based: focal loss, weighted CE.
- Threshold tuning: 매 default 0.5 의 X.
- Anomaly detection (1-class).
- Cost-sensitive learning.
매 metric
- PR-AUC (Average Precision).
- F1 / Macro-F1.
- MCC (Matthews Correlation Coefficient).
- Cohen's κ.
- Confusion matrix.
매 응용
- Fraud detection.
- Medical (rare disease).
- Anomaly detection.
- Customer churn.
- Click prediction.
💻 패턴
Class weight (sklearn)
from sklearn.utils import class_weight
weights = class_weight.compute_class_weight('balanced', classes=np.unique(y), y=y)
weight_dict = dict(zip(np.unique(y), weights))
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(class_weight=weight_dict).fit(X, y)
SMOTE (imbalanced-learn)
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=0)
X_res, y_res = sm.fit_resample(X, y)
SMOTE-NC (mixed numerical + categorical)
from imblearn.over_sampling import SMOTENC
smnc = SMOTENC(categorical_features=[0, 2, 5], random_state=0)
X_res, y_res = smnc.fit_resample(X, y)
Random undersample
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy=0.5, random_state=0)
X_res, y_res = rus.fit_resample(X, y)
Combined (SMOTE + Tomek)
from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=0)
X_res, y_res = smt.fit_resample(X, y)
Pipeline (avoid leakage)
from imblearn.pipeline import Pipeline as ImbPipeline
pipe = ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE()),
('clf', LogisticRegression()),
])
# 매 SMOTE applied 매 매 fold (CV-safe)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring='f1')
Focal loss (PyTorch)
import torch.nn.functional as F
def focal_loss(logits, targets, alpha=0.25, gamma=2.0):
p = torch.sigmoid(logits)
p_t = p * targets + (1 - p) * (1 - targets)
alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
bce = F.binary_cross_entropy_with_logits(logits, targets, reduction='none')
return (alpha_t * (1 - p_t) ** gamma * bce).mean()
Threshold tuning
from sklearn.metrics import precision_recall_curve
y_score = model.predict_proba(X_val)[:, 1]
prec, rec, thr = precision_recall_curve(y_val, y_score)
f1_scores = 2 * prec * rec / (prec + rec + 1e-9)
best_thr = thr[f1_scores.argmax()]
y_pred = (y_score > best_thr).astype(int)
XGBoost scale_pos_weight
import xgboost as xgb
ratio = sum(y == 0) / sum(y == 1)
model = xgb.XGBClassifier(scale_pos_weight=ratio).fit(X, y)
Eval metrics (proper)
from sklearn.metrics import classification_report, average_precision_score, matthews_corrcoef, confusion_matrix
print(classification_report(y_val, y_pred))
print(f'PR-AUC: {average_precision_score(y_val, y_score):.3f}')
print(f'MCC: {matthews_corrcoef(y_val, y_pred):.3f}')
print(confusion_matrix(y_val, y_pred))
Cost-sensitive learning
COST_MATRIX = np.array([[0, 1], [10, 0]]) # 매 FN cost = 10x FP
def cost_sensitive_predict(probs, cost):
expected_cost = probs @ cost
return expected_cost.argmin(axis=1)
One-class anomaly
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01).fit(X_majority)
anomalies = iso.predict(X_test) == -1
Weighted sampler (PyTorch)
from torch.utils.data import WeightedRandomSampler
class_counts = [sum(y == c) for c in np.unique(y)]
weights = [1.0 / class_counts[label] for label in y]
sampler = WeightedRandomSampler(weights, num_samples=len(y), replacement=True)
loader = DataLoader(dataset, batch_size=32, sampler=sampler)
Stratified split
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_idx, val_idx in skf.split(X, y):
train_fold(X[train_idx], y[train_idx])
Borderline-SMOTE
from imblearn.over_sampling import BorderlineSMOTE
bsm = BorderlineSMOTE(random_state=0)
X_res, y_res = bsm.fit_resample(X, y)
Calibration check after handling
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_val, y_score, n_bins=10)
# 매 oversampling 매 의 의 calibration 의 distort
매 결정 기준
| 상황 | Approach |
|---|---|
| Mild (< 1:10) | Class weight |
| Moderate (1:10-1:100) | SMOTE / class weight |
| Severe (> 1:100) | Anomaly detection / focal |
| Tabular | XGBoost scale_pos_weight |
| DL | Focal loss + weighted sampler |
| Cost varies | Cost-sensitive |
기본값: 매 class weight + 매 threshold tune + 매 PR-AUC eval. 매 severe = focal + anomaly detection 의 explore. 매 SMOTE 는 careful (calibration distort).
🔗 Graph
- 부모: Machine-Learning · Data-Preprocessing
- 변형: SMOTE · Focal-Loss
- 응용: Anomaly-Detection
🤖 LLM 활용
언제: 매 fraud / medical / churn / anomaly. 언제 X: 매 balanced.
❌ 안티패턴
- Accuracy metric on imbalanced: 매 misleading.
- SMOTE before train/val split: 매 leakage.
- No threshold tune: 매 default 0.5 의 wrong.
- Aggressive oversample: 매 calibration 의 break.
- Ignore minority cost: 매 FN expensive.
🧪 검증 / 중복
- Verified (Chawla SMOTE 2002, He & Garcia review 2009, Lin focal 2017).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — methods + 매 SMOTE / focal / threshold / scale_pos_weight code |