f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.4 KiB
7.4 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-ensemble-methods | Ensemble Methods | 10_Wiki/Topics | verified | self |
|
none | A | 0.98 | applied |
|
2026-05-10 | pending |
|
Ensemble Methods
매 한 줄
"매 several model 의 combine 의 single 의 outperform". 매 bagging (variance ↓), 매 boosting (bias ↓), 매 stacking (meta). 매 modern: 매 GBDT (XGBoost, LightGBM, CatBoost) 의 tabular 의 dominant. 매 deep ensemble 의 uncertainty.
매 핵심
매 type
- Bagging: 매 parallel + bootstrap (Random Forest).
- Boosting: 매 sequential + correct (XGBoost).
- Stacking: 매 meta-learner 의 output 의 fuse.
- Voting / averaging: 매 simple combine.
매 famous
- Random Forest (Breiman 2001).
- AdaBoost (Freund & Schapire 1997).
- Gradient Boosting (Friedman 1999).
- XGBoost (Chen & Guestrin 2016).
- LightGBM (Ke 2017).
- CatBoost (Yandex).
매 modern context
- Tabular: 매 GBDT 의 still 의 SOTA (vs DL).
- Kaggle: 매 ensemble 의 winning.
- Deep ensemble: 매 uncertainty (Lakshminarayanan).
- Mixture of Experts (MoE): 매 sparse ensemble (LLM).
매 응용
- Tabular classification / regression.
- Anomaly detection: 매 Isolation Forest.
- Click prediction: 매 GBDT.
- Risk scoring: 매 finance.
- Search ranking: 매 LambdaMART.
- Survival analysis: 매 random survival forest.
💻 패턴
Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=500,
max_depth=20,
min_samples_leaf=5,
n_jobs=-1,
class_weight='balanced',
).fit(X_train, y_train)
print(rf.feature_importances_)
XGBoost
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
params = {
'objective': 'binary:logistic',
'tree_method': 'hist',
'device': 'cuda',
'eta': 0.05,
'max_depth': 6,
'subsample': 0.8,
'colsample_bytree': 0.8,
'eval_metric': 'auc',
}
model = xgb.train(params, dtrain, num_boost_round=2000,
evals=[(dval, 'val')], early_stopping_rounds=50)
LightGBM (faster)
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
params = {
'objective': 'binary',
'metric': 'auc',
'learning_rate': 0.05,
'num_leaves': 63,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
}
model = lgb.train(params, train_data, num_boost_round=2000,
valid_sets=[val_data], callbacks=[lgb.early_stopping(50)])
CatBoost (categorical native)
from catboost import CatBoostClassifier
model = CatBoostClassifier(
iterations=2000,
learning_rate=0.05,
depth=6,
cat_features=['city', 'device', 'segment'],
eval_metric='AUC',
early_stopping_rounds=50,
).fit(X_train, y_train, eval_set=(X_val, y_val))
Stacking (sklearn)
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
stack = StackingClassifier(
estimators=[
('rf', RandomForestClassifier(n_estimators=200)),
('xgb', xgb.XGBClassifier(n_estimators=200)),
('lgb', lgb.LGBMClassifier(n_estimators=200)),
],
final_estimator=LogisticRegression(),
cv=5,
)
stack.fit(X_train, y_train)
Voting (simple)
from sklearn.ensemble import VotingClassifier
vote = VotingClassifier(
estimators=[('rf', rf), ('xgb', xgb_clf), ('lgb', lgb_clf)],
voting='soft', # 매 averaging probabilities
weights=[1, 2, 1],
).fit(X_train, y_train)
Deep ensemble (uncertainty)
import torch
def deep_ensemble_predict(models, x):
preds = torch.stack([m(x) for m in models])
mean = preds.mean(0)
epistemic = preds.var(0)
return mean, epistemic
# 매 5 models 의 different seed
models = [train_one(seed=s) for s in range(5)]
Snapshot ensemble (cheap)
def snapshot_ensemble(model, X_train, y_train, n_snapshots=5):
snapshots = []
for snap in range(n_snapshots):
# 매 cosine annealing 의 SGDR
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10)
for epoch in range(10):
train_one_epoch(model, X_train, y_train, scheduler)
snapshots.append(copy.deepcopy(model))
return snapshots
Isolation Forest (anomaly)
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01, n_estimators=200).fit(X)
anomalies = iso.predict(X) # 매 -1 = anomaly
Bagging custom
def bagging(X, y, base_model_cls, n_estimators=10):
models = []
for _ in range(n_estimators):
idx = np.random.choice(len(X), size=len(X), replace=True)
m = base_model_cls().fit(X[idx], y[idx])
models.append(m)
return models
def bagging_predict(models, X):
return np.mean([m.predict_proba(X) for m in models], axis=0)
Out-of-Bag (OOB) validation
rf = RandomForestClassifier(n_estimators=500, oob_score=True).fit(X, y)
print(rf.oob_score_) # 매 free hold-out
MoE (sparse ensemble, LLM)
class MoE(nn.Module):
def __init__(self, n_experts=8, top_k=2, dim=512):
super().__init__()
self.experts = nn.ModuleList([nn.Linear(dim, dim) for _ in range(n_experts)])
self.gate = nn.Linear(dim, n_experts)
self.top_k = top_k
def forward(self, x):
logits = self.gate(x)
top_k_v, top_k_i = logits.topk(self.top_k, dim=-1)
weights = top_k_v.softmax(-1)
out = sum(weights[..., k:k+1] * self.experts[top_k_i[..., k]](x) for k in range(self.top_k))
return out
매 결정 기준
| 상황 | Method |
|---|---|
| Tabular (small) | Random Forest |
| Tabular (large) | XGBoost / LightGBM |
| Tabular + categorical | CatBoost |
| Anomaly | Isolation Forest |
| Production deep | Snapshot ensemble |
| Uncertainty | Deep ensemble |
| LLM scale | MoE |
| Kaggle competition | Stacking + diverse base |
기본값: 매 tabular = LightGBM/XGBoost early-stop + 매 categorical = CatBoost + 매 deep = snapshot or 5x ensemble + 매 uncertainty = deep ensemble.
🔗 Graph
- 부모: Machine-Learning
- 변형: Random-Forest · XGBoost · LightGBM · Stacking
- 응용: Anomaly-Detection
- Adjacent: Mixture-of-Experts · Deep-Ensemble · Bagging · Boosting
🤖 LLM 활용
언제: 매 tabular ML. 매 click / risk model. 매 uncertainty needed. 언제 X: 매 vision / language (DL win).
❌ 안티패턴
- Train without early stop: 매 overfit GBDT.
- No CV stacking: 매 leak.
- Weight blending without validation: 매 random.
- Same model 5x: 매 diversity X.
- Massive ensemble at inference: 매 latency.
🧪 검증 / 중복
- Verified (Breiman 2001, Chen 2016, Lakshminarayanan 2017).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-04-26 | ENSEMBLE auto |
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — bagging/boost/stack + 매 RF / XGB / LGB / CB / MoE code |