f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
262 lines
7.4 KiB
Markdown
262 lines
7.4 KiB
Markdown
---
|
|
id: wiki-2026-0508-ensemble-methods
|
|
title: Ensemble Methods
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [ensemble, bagging, boosting, stacking, random forest, XGBoost, LightGBM, CatBoost]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.98
|
|
verification_status: applied
|
|
tags: [machine-learning, ensemble, bagging, boosting, stacking, gbm, xgboost]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: scikit-learn / XGBoost / LightGBM / CatBoost
|
|
---
|
|
|
|
# Ensemble Methods
|
|
|
|
## 매 한 줄
|
|
> **"매 several model 의 combine 의 single 의 outperform"**. 매 bagging (variance ↓), 매 boosting (bias ↓), 매 stacking (meta). 매 modern: 매 GBDT (XGBoost, LightGBM, CatBoost) 의 tabular 의 dominant. 매 deep ensemble 의 uncertainty.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 type
|
|
- **Bagging**: 매 parallel + bootstrap (Random Forest).
|
|
- **Boosting**: 매 sequential + correct (XGBoost).
|
|
- **Stacking**: 매 meta-learner 의 output 의 fuse.
|
|
- **Voting / averaging**: 매 simple combine.
|
|
|
|
### 매 famous
|
|
- **Random Forest** (Breiman 2001).
|
|
- **AdaBoost** (Freund & Schapire 1997).
|
|
- **Gradient Boosting** (Friedman 1999).
|
|
- **XGBoost** (Chen & Guestrin 2016).
|
|
- **LightGBM** (Ke 2017).
|
|
- **CatBoost** (Yandex).
|
|
|
|
### 매 modern context
|
|
- **Tabular**: 매 GBDT 의 still 의 SOTA (vs DL).
|
|
- **Kaggle**: 매 ensemble 의 winning.
|
|
- **Deep ensemble**: 매 uncertainty (Lakshminarayanan).
|
|
- **Mixture of Experts (MoE)**: 매 sparse ensemble (LLM).
|
|
|
|
### 매 응용
|
|
1. **Tabular classification / regression**.
|
|
2. **Anomaly detection**: 매 Isolation Forest.
|
|
3. **Click prediction**: 매 GBDT.
|
|
4. **Risk scoring**: 매 finance.
|
|
5. **Search ranking**: 매 LambdaMART.
|
|
6. **Survival analysis**: 매 random survival forest.
|
|
|
|
## 💻 패턴
|
|
|
|
### Random Forest
|
|
```python
|
|
from sklearn.ensemble import RandomForestClassifier
|
|
rf = RandomForestClassifier(
|
|
n_estimators=500,
|
|
max_depth=20,
|
|
min_samples_leaf=5,
|
|
n_jobs=-1,
|
|
class_weight='balanced',
|
|
).fit(X_train, y_train)
|
|
|
|
print(rf.feature_importances_)
|
|
```
|
|
|
|
### XGBoost
|
|
```python
|
|
import xgboost as xgb
|
|
dtrain = xgb.DMatrix(X_train, label=y_train)
|
|
dval = xgb.DMatrix(X_val, label=y_val)
|
|
|
|
params = {
|
|
'objective': 'binary:logistic',
|
|
'tree_method': 'hist',
|
|
'device': 'cuda',
|
|
'eta': 0.05,
|
|
'max_depth': 6,
|
|
'subsample': 0.8,
|
|
'colsample_bytree': 0.8,
|
|
'eval_metric': 'auc',
|
|
}
|
|
model = xgb.train(params, dtrain, num_boost_round=2000,
|
|
evals=[(dval, 'val')], early_stopping_rounds=50)
|
|
```
|
|
|
|
### LightGBM (faster)
|
|
```python
|
|
import lightgbm as lgb
|
|
train_data = lgb.Dataset(X_train, label=y_train)
|
|
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
|
|
|
|
params = {
|
|
'objective': 'binary',
|
|
'metric': 'auc',
|
|
'learning_rate': 0.05,
|
|
'num_leaves': 63,
|
|
'feature_fraction': 0.8,
|
|
'bagging_fraction': 0.8,
|
|
'bagging_freq': 5,
|
|
}
|
|
model = lgb.train(params, train_data, num_boost_round=2000,
|
|
valid_sets=[val_data], callbacks=[lgb.early_stopping(50)])
|
|
```
|
|
|
|
### CatBoost (categorical native)
|
|
```python
|
|
from catboost import CatBoostClassifier
|
|
model = CatBoostClassifier(
|
|
iterations=2000,
|
|
learning_rate=0.05,
|
|
depth=6,
|
|
cat_features=['city', 'device', 'segment'],
|
|
eval_metric='AUC',
|
|
early_stopping_rounds=50,
|
|
).fit(X_train, y_train, eval_set=(X_val, y_val))
|
|
```
|
|
|
|
### Stacking (sklearn)
|
|
```python
|
|
from sklearn.ensemble import StackingClassifier
|
|
from sklearn.linear_model import LogisticRegression
|
|
|
|
stack = StackingClassifier(
|
|
estimators=[
|
|
('rf', RandomForestClassifier(n_estimators=200)),
|
|
('xgb', xgb.XGBClassifier(n_estimators=200)),
|
|
('lgb', lgb.LGBMClassifier(n_estimators=200)),
|
|
],
|
|
final_estimator=LogisticRegression(),
|
|
cv=5,
|
|
)
|
|
stack.fit(X_train, y_train)
|
|
```
|
|
|
|
### Voting (simple)
|
|
```python
|
|
from sklearn.ensemble import VotingClassifier
|
|
vote = VotingClassifier(
|
|
estimators=[('rf', rf), ('xgb', xgb_clf), ('lgb', lgb_clf)],
|
|
voting='soft', # 매 averaging probabilities
|
|
weights=[1, 2, 1],
|
|
).fit(X_train, y_train)
|
|
```
|
|
|
|
### Deep ensemble (uncertainty)
|
|
```python
|
|
import torch
|
|
def deep_ensemble_predict(models, x):
|
|
preds = torch.stack([m(x) for m in models])
|
|
mean = preds.mean(0)
|
|
epistemic = preds.var(0)
|
|
return mean, epistemic
|
|
|
|
# 매 5 models 의 different seed
|
|
models = [train_one(seed=s) for s in range(5)]
|
|
```
|
|
|
|
### Snapshot ensemble (cheap)
|
|
```python
|
|
def snapshot_ensemble(model, X_train, y_train, n_snapshots=5):
|
|
snapshots = []
|
|
for snap in range(n_snapshots):
|
|
# 매 cosine annealing 의 SGDR
|
|
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10)
|
|
for epoch in range(10):
|
|
train_one_epoch(model, X_train, y_train, scheduler)
|
|
snapshots.append(copy.deepcopy(model))
|
|
return snapshots
|
|
```
|
|
|
|
### Isolation Forest (anomaly)
|
|
```python
|
|
from sklearn.ensemble import IsolationForest
|
|
iso = IsolationForest(contamination=0.01, n_estimators=200).fit(X)
|
|
anomalies = iso.predict(X) # 매 -1 = anomaly
|
|
```
|
|
|
|
### Bagging custom
|
|
```python
|
|
def bagging(X, y, base_model_cls, n_estimators=10):
|
|
models = []
|
|
for _ in range(n_estimators):
|
|
idx = np.random.choice(len(X), size=len(X), replace=True)
|
|
m = base_model_cls().fit(X[idx], y[idx])
|
|
models.append(m)
|
|
return models
|
|
|
|
def bagging_predict(models, X):
|
|
return np.mean([m.predict_proba(X) for m in models], axis=0)
|
|
```
|
|
|
|
### Out-of-Bag (OOB) validation
|
|
```python
|
|
rf = RandomForestClassifier(n_estimators=500, oob_score=True).fit(X, y)
|
|
print(rf.oob_score_) # 매 free hold-out
|
|
```
|
|
|
|
### MoE (sparse ensemble, LLM)
|
|
```python
|
|
class MoE(nn.Module):
|
|
def __init__(self, n_experts=8, top_k=2, dim=512):
|
|
super().__init__()
|
|
self.experts = nn.ModuleList([nn.Linear(dim, dim) for _ in range(n_experts)])
|
|
self.gate = nn.Linear(dim, n_experts)
|
|
self.top_k = top_k
|
|
|
|
def forward(self, x):
|
|
logits = self.gate(x)
|
|
top_k_v, top_k_i = logits.topk(self.top_k, dim=-1)
|
|
weights = top_k_v.softmax(-1)
|
|
out = sum(weights[..., k:k+1] * self.experts[top_k_i[..., k]](x) for k in range(self.top_k))
|
|
return out
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Method |
|
|
|---|---|
|
|
| Tabular (small) | Random Forest |
|
|
| Tabular (large) | XGBoost / LightGBM |
|
|
| Tabular + categorical | CatBoost |
|
|
| Anomaly | Isolation Forest |
|
|
| Production deep | Snapshot ensemble |
|
|
| Uncertainty | Deep ensemble |
|
|
| LLM scale | MoE |
|
|
| Kaggle competition | Stacking + diverse base |
|
|
|
|
**기본값**: 매 tabular = LightGBM/XGBoost early-stop + 매 categorical = CatBoost + 매 deep = snapshot or 5x ensemble + 매 uncertainty = deep ensemble.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Machine-Learning]]
|
|
- 변형: [[Random-Forest]] · [[XGBoost]] · [[LightGBM]] · [[Stacking]]
|
|
- 응용: [[Anomaly-Detection]]
|
|
- Adjacent: [[Mixture-of-Experts]] · [[Deep-Ensemble]] · [[Bagging]] · [[Boosting]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 tabular ML. 매 click / risk model. 매 uncertainty needed.
|
|
**언제 X**: 매 vision / language (DL win).
|
|
|
|
## ❌ 안티패턴
|
|
- **Train without early stop**: 매 overfit GBDT.
|
|
- **No CV stacking**: 매 leak.
|
|
- **Weight blending without validation**: 매 random.
|
|
- **Same model 5x**: 매 diversity X.
|
|
- **Massive ensemble at inference**: 매 latency.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Breiman 2001, Chen 2016, Lakshminarayanan 2017).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-04-26 | ENSEMBLE auto |
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — bagging/boost/stack + 매 RF / XGB / LGB / CB / MoE code |
|