Files
2nd/10_Wiki/Topics/AI_and_ML/Ensemble-Methods.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.4 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-ensemble-methods Ensemble Methods 10_Wiki/Topics verified self
ensemble
bagging
boosting
stacking
random forest
XGBoost
LightGBM
CatBoost
none A 0.98 applied
machine-learning
ensemble
bagging
boosting
stacking
gbm
xgboost
2026-05-10 pending
language framework
Python scikit-learn / XGBoost / LightGBM / CatBoost

Ensemble Methods

매 한 줄

"매 several model 의 combine 의 single 의 outperform". 매 bagging (variance ↓), 매 boosting (bias ↓), 매 stacking (meta). 매 modern: 매 GBDT (XGBoost, LightGBM, CatBoost) 의 tabular 의 dominant. 매 deep ensemble 의 uncertainty.

매 핵심

매 type

  • Bagging: 매 parallel + bootstrap (Random Forest).
  • Boosting: 매 sequential + correct (XGBoost).
  • Stacking: 매 meta-learner 의 output 의 fuse.
  • Voting / averaging: 매 simple combine.

매 famous

  • Random Forest (Breiman 2001).
  • AdaBoost (Freund & Schapire 1997).
  • Gradient Boosting (Friedman 1999).
  • XGBoost (Chen & Guestrin 2016).
  • LightGBM (Ke 2017).
  • CatBoost (Yandex).

매 modern context

  • Tabular: 매 GBDT 의 still 의 SOTA (vs DL).
  • Kaggle: 매 ensemble 의 winning.
  • Deep ensemble: 매 uncertainty (Lakshminarayanan).
  • Mixture of Experts (MoE): 매 sparse ensemble (LLM).

매 응용

  1. Tabular classification / regression.
  2. Anomaly detection: 매 Isolation Forest.
  3. Click prediction: 매 GBDT.
  4. Risk scoring: 매 finance.
  5. Search ranking: 매 LambdaMART.
  6. Survival analysis: 매 random survival forest.

💻 패턴

Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=20,
    min_samples_leaf=5,
    n_jobs=-1,
    class_weight='balanced',
).fit(X_train, y_train)

print(rf.feature_importances_)

XGBoost

import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

params = {
    'objective': 'binary:logistic',
    'tree_method': 'hist',
    'device': 'cuda',
    'eta': 0.05,
    'max_depth': 6,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'auc',
}
model = xgb.train(params, dtrain, num_boost_round=2000,
                   evals=[(dval, 'val')], early_stopping_rounds=50)

LightGBM (faster)

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

params = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,
    'num_leaves': 63,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
}
model = lgb.train(params, train_data, num_boost_round=2000,
                   valid_sets=[val_data], callbacks=[lgb.early_stopping(50)])

CatBoost (categorical native)

from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=2000,
    learning_rate=0.05,
    depth=6,
    cat_features=['city', 'device', 'segment'],
    eval_metric='AUC',
    early_stopping_rounds=50,
).fit(X_train, y_train, eval_set=(X_val, y_val))

Stacking (sklearn)

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

stack = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=200)),
        ('xgb', xgb.XGBClassifier(n_estimators=200)),
        ('lgb', lgb.LGBMClassifier(n_estimators=200)),
    ],
    final_estimator=LogisticRegression(),
    cv=5,
)
stack.fit(X_train, y_train)

Voting (simple)

from sklearn.ensemble import VotingClassifier
vote = VotingClassifier(
    estimators=[('rf', rf), ('xgb', xgb_clf), ('lgb', lgb_clf)],
    voting='soft',  # 매 averaging probabilities
    weights=[1, 2, 1],
).fit(X_train, y_train)

Deep ensemble (uncertainty)

import torch
def deep_ensemble_predict(models, x):
    preds = torch.stack([m(x) for m in models])
    mean = preds.mean(0)
    epistemic = preds.var(0)
    return mean, epistemic

# 매 5 models 의 different seed
models = [train_one(seed=s) for s in range(5)]

Snapshot ensemble (cheap)

def snapshot_ensemble(model, X_train, y_train, n_snapshots=5):
    snapshots = []
    for snap in range(n_snapshots):
        # 매 cosine annealing 의 SGDR
        scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10)
        for epoch in range(10):
            train_one_epoch(model, X_train, y_train, scheduler)
        snapshots.append(copy.deepcopy(model))
    return snapshots

Isolation Forest (anomaly)

from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01, n_estimators=200).fit(X)
anomalies = iso.predict(X)  # 매 -1 = anomaly

Bagging custom

def bagging(X, y, base_model_cls, n_estimators=10):
    models = []
    for _ in range(n_estimators):
        idx = np.random.choice(len(X), size=len(X), replace=True)
        m = base_model_cls().fit(X[idx], y[idx])
        models.append(m)
    return models

def bagging_predict(models, X):
    return np.mean([m.predict_proba(X) for m in models], axis=0)

Out-of-Bag (OOB) validation

rf = RandomForestClassifier(n_estimators=500, oob_score=True).fit(X, y)
print(rf.oob_score_)  # 매 free hold-out

MoE (sparse ensemble, LLM)

class MoE(nn.Module):
    def __init__(self, n_experts=8, top_k=2, dim=512):
        super().__init__()
        self.experts = nn.ModuleList([nn.Linear(dim, dim) for _ in range(n_experts)])
        self.gate = nn.Linear(dim, n_experts)
        self.top_k = top_k
    
    def forward(self, x):
        logits = self.gate(x)
        top_k_v, top_k_i = logits.topk(self.top_k, dim=-1)
        weights = top_k_v.softmax(-1)
        out = sum(weights[..., k:k+1] * self.experts[top_k_i[..., k]](x) for k in range(self.top_k))
        return out

매 결정 기준

상황 Method
Tabular (small) Random Forest
Tabular (large) XGBoost / LightGBM
Tabular + categorical CatBoost
Anomaly Isolation Forest
Production deep Snapshot ensemble
Uncertainty Deep ensemble
LLM scale MoE
Kaggle competition Stacking + diverse base

기본값: 매 tabular = LightGBM/XGBoost early-stop + 매 categorical = CatBoost + 매 deep = snapshot or 5x ensemble + 매 uncertainty = deep ensemble.

🔗 Graph

🤖 LLM 활용

언제: 매 tabular ML. 매 click / risk model. 매 uncertainty needed. 언제 X: 매 vision / language (DL win).

안티패턴

  • Train without early stop: 매 overfit GBDT.
  • No CV stacking: 매 leak.
  • Weight blending without validation: 매 random.
  • Same model 5x: 매 diversity X.
  • Massive ensemble at inference: 매 latency.

🧪 검증 / 중복

  • Verified (Breiman 2001, Chen 2016, Lakshminarayanan 2017).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-04-26 ENSEMBLE auto
2026-05-08 Phase 1
2026-05-10 Manual cleanup — bagging/boost/stack + 매 RF / XGB / LGB / CB / MoE code