--- id: wiki-2026-0508-ensemble-methods title: Ensemble Methods category: 10_Wiki/Topics status: verified canonical_id: self aliases: [ensemble, bagging, boosting, stacking, random forest, XGBoost, LightGBM, CatBoost] duplicate_of: none source_trust_level: A confidence_score: 0.98 verification_status: applied tags: [machine-learning, ensemble, bagging, boosting, stacking, gbm, xgboost] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn / XGBoost / LightGBM / CatBoost --- # Ensemble Methods ## 매 한 줄 > **"매 several model 의 combine 의 single 의 outperform"**. 매 bagging (variance ↓), 매 boosting (bias ↓), 매 stacking (meta). 매 modern: 매 GBDT (XGBoost, LightGBM, CatBoost) 의 tabular 의 dominant. 매 deep ensemble 의 uncertainty. ## 매 핵심 ### 매 type - **Bagging**: 매 parallel + bootstrap (Random Forest). - **Boosting**: 매 sequential + correct (XGBoost). - **Stacking**: 매 meta-learner 의 output 의 fuse. - **Voting / averaging**: 매 simple combine. ### 매 famous - **Random Forest** (Breiman 2001). - **AdaBoost** (Freund & Schapire 1997). - **Gradient Boosting** (Friedman 1999). - **XGBoost** (Chen & Guestrin 2016). - **LightGBM** (Ke 2017). - **CatBoost** (Yandex). ### 매 modern context - **Tabular**: 매 GBDT 의 still 의 SOTA (vs DL). - **Kaggle**: 매 ensemble 의 winning. - **Deep ensemble**: 매 uncertainty (Lakshminarayanan). - **Mixture of Experts (MoE)**: 매 sparse ensemble (LLM). ### 매 응용 1. **Tabular classification / regression**. 2. **Anomaly detection**: 매 Isolation Forest. 3. **Click prediction**: 매 GBDT. 4. **Risk scoring**: 매 finance. 5. **Search ranking**: 매 LambdaMART. 6. **Survival analysis**: 매 random survival forest. ## 💻 패턴 ### Random Forest ```python from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier( n_estimators=500, max_depth=20, min_samples_leaf=5, n_jobs=-1, class_weight='balanced', ).fit(X_train, y_train) print(rf.feature_importances_) ``` ### XGBoost ```python import xgboost as xgb dtrain = xgb.DMatrix(X_train, label=y_train) dval = xgb.DMatrix(X_val, label=y_val) params = { 'objective': 'binary:logistic', 'tree_method': 'hist', 'device': 'cuda', 'eta': 0.05, 'max_depth': 6, 'subsample': 0.8, 'colsample_bytree': 0.8, 'eval_metric': 'auc', } model = xgb.train(params, dtrain, num_boost_round=2000, evals=[(dval, 'val')], early_stopping_rounds=50) ``` ### LightGBM (faster) ```python import lightgbm as lgb train_data = lgb.Dataset(X_train, label=y_train) val_data = lgb.Dataset(X_val, label=y_val, reference=train_data) params = { 'objective': 'binary', 'metric': 'auc', 'learning_rate': 0.05, 'num_leaves': 63, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 5, } model = lgb.train(params, train_data, num_boost_round=2000, valid_sets=[val_data], callbacks=[lgb.early_stopping(50)]) ``` ### CatBoost (categorical native) ```python from catboost import CatBoostClassifier model = CatBoostClassifier( iterations=2000, learning_rate=0.05, depth=6, cat_features=['city', 'device', 'segment'], eval_metric='AUC', early_stopping_rounds=50, ).fit(X_train, y_train, eval_set=(X_val, y_val)) ``` ### Stacking (sklearn) ```python from sklearn.ensemble import StackingClassifier from sklearn.linear_model import LogisticRegression stack = StackingClassifier( estimators=[ ('rf', RandomForestClassifier(n_estimators=200)), ('xgb', xgb.XGBClassifier(n_estimators=200)), ('lgb', lgb.LGBMClassifier(n_estimators=200)), ], final_estimator=LogisticRegression(), cv=5, ) stack.fit(X_train, y_train) ``` ### Voting (simple) ```python from sklearn.ensemble import VotingClassifier vote = VotingClassifier( estimators=[('rf', rf), ('xgb', xgb_clf), ('lgb', lgb_clf)], voting='soft', # 매 averaging probabilities weights=[1, 2, 1], ).fit(X_train, y_train) ``` ### Deep ensemble (uncertainty) ```python import torch def deep_ensemble_predict(models, x): preds = torch.stack([m(x) for m in models]) mean = preds.mean(0) epistemic = preds.var(0) return mean, epistemic # 매 5 models 의 different seed models = [train_one(seed=s) for s in range(5)] ``` ### Snapshot ensemble (cheap) ```python def snapshot_ensemble(model, X_train, y_train, n_snapshots=5): snapshots = [] for snap in range(n_snapshots): # 매 cosine annealing 의 SGDR scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10) for epoch in range(10): train_one_epoch(model, X_train, y_train, scheduler) snapshots.append(copy.deepcopy(model)) return snapshots ``` ### Isolation Forest (anomaly) ```python from sklearn.ensemble import IsolationForest iso = IsolationForest(contamination=0.01, n_estimators=200).fit(X) anomalies = iso.predict(X) # 매 -1 = anomaly ``` ### Bagging custom ```python def bagging(X, y, base_model_cls, n_estimators=10): models = [] for _ in range(n_estimators): idx = np.random.choice(len(X), size=len(X), replace=True) m = base_model_cls().fit(X[idx], y[idx]) models.append(m) return models def bagging_predict(models, X): return np.mean([m.predict_proba(X) for m in models], axis=0) ``` ### Out-of-Bag (OOB) validation ```python rf = RandomForestClassifier(n_estimators=500, oob_score=True).fit(X, y) print(rf.oob_score_) # 매 free hold-out ``` ### MoE (sparse ensemble, LLM) ```python class MoE(nn.Module): def __init__(self, n_experts=8, top_k=2, dim=512): super().__init__() self.experts = nn.ModuleList([nn.Linear(dim, dim) for _ in range(n_experts)]) self.gate = nn.Linear(dim, n_experts) self.top_k = top_k def forward(self, x): logits = self.gate(x) top_k_v, top_k_i = logits.topk(self.top_k, dim=-1) weights = top_k_v.softmax(-1) out = sum(weights[..., k:k+1] * self.experts[top_k_i[..., k]](x) for k in range(self.top_k)) return out ``` ## 매 결정 기준 | 상황 | Method | |---|---| | Tabular (small) | Random Forest | | Tabular (large) | XGBoost / LightGBM | | Tabular + categorical | CatBoost | | Anomaly | Isolation Forest | | Production deep | Snapshot ensemble | | Uncertainty | Deep ensemble | | LLM scale | MoE | | Kaggle competition | Stacking + diverse base | **기본값**: 매 tabular = LightGBM/XGBoost early-stop + 매 categorical = CatBoost + 매 deep = snapshot or 5x ensemble + 매 uncertainty = deep ensemble. ## 🔗 Graph - 부모: [[Machine-Learning]] - 변형: [[Random-Forest]] · [[XGBoost]] · [[LightGBM]] · [[Stacking]] - 응용: [[Anomaly-Detection]] - Adjacent: [[Mixture-of-Experts]] · [[Deep-Ensemble]] · [[Bagging]] · [[Boosting]] ## 🤖 LLM 활용 **언제**: 매 tabular ML. 매 click / risk model. 매 uncertainty needed. **언제 X**: 매 vision / language (DL win). ## ❌ 안티패턴 - **Train without early stop**: 매 overfit GBDT. - **No CV stacking**: 매 leak. - **Weight blending without validation**: 매 random. - **Same model 5x**: 매 diversity X. - **Massive ensemble at inference**: 매 latency. ## 🧪 검증 / 중복 - Verified (Breiman 2001, Chen 2016, Lakshminarayanan 2017). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-04-26 | ENSEMBLE auto | | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — bagging/boost/stack + 매 RF / XGB / LGB / CB / MoE code |