--- id: wiki-2026-0508-boosting-xgboost-lightgbm title: Boosting Algorithms (XGBoost / LightGBM / CatBoost) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [boosting, gradient boosting, GBM, XGBoost, LightGBM, CatBoost, AdaBoost, ensemble] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [ml, boosting, xgboost, lightgbm, catboost, ensemble, tabular-data, kaggle, gradient-boosting] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: XGBoost / LightGBM / CatBoost / scikit-learn --- # Boosting Algorithms ## 📌 한 줄 통찰 > **"매 오답 노트 의 군단"**. 매 weak learner 의 sequential 학습 + 매 previous error 의 weight ↑. 매 tabular data 의 still-king (vs deep learning). 매 Kaggle 의 default. 매 XGBoost / LightGBM / CatBoost 의 trinity. ## 📖 핵심 ### 매 algorithm history 1. **AdaBoost** (1995): 매 weighted re-sample. 2. **Gradient Boosting** (Friedman 1999): 매 residual fit. 3. **XGBoost** (2014): 매 regularization + parallel. 4. **LightGBM** (2017): 매 GOSS + EFB. 5. **CatBoost** (2017): 매 ordered boosting + categorical. ### Gradient Boosting Machine (GBM) 의 origin - 매 model 의 sequential 추가. - 매 each step: 매 negative gradient (residual) 의 fit. - 매 stage-wise. ### XGBoost (Extreme GBoost) - 매 regularization (L1, L2) on leaf weight. - 매 second-order Taylor expansion. - 매 sparse-aware. - 매 parallel computing (per feature). - 매 missing value handling. - 매 cache-aware. ### LightGBM (Microsoft) - **GOSS** (Gradient-based One-Side Sampling): 매 high-gradient sample 의 keep. - **EFB** (Exclusive Feature Bundling): 매 sparse feature 의 merge. - 매 leaf-wise (vs level-wise) → 매 deeper. - 매 fastest, 매 large dataset friendly. ### CatBoost (Yandex) - 매 categorical feature 의 native. - 매 ordered boosting → 매 target leakage 의 mitigate. - 매 GPU support. - 매 default 가 좋음. ### 매 hyperparameter (cross-tool) #### Tree - `max_depth` (XGBoost) / `num_leaves` (LightGBM): 매 5-10. - `min_child_weight` / `min_data_in_leaf`: 매 over-fit 방지. #### Learning - `learning_rate` (η): 매 0.01-0.3. 매 작 → 매 N tree ↑. - `n_estimators` / `num_boost_round`: 매 100-10000. - `subsample`: 매 row sample (0.7-1.0). - `colsample_bytree`: 매 feature sample. #### Regularization - `reg_alpha` (L1): 매 sparsity. - `reg_lambda` (L2): 매 weight ↓. - `gamma` / `min_split_loss`: 매 split threshold. ### 매 over-fit 방지 - **Early stopping**: 매 validation 의 plateau. - **Low learning rate + many trees**: 매 best practice. - **Subsample row + col**. - **Regularization** (reg_alpha, reg_lambda). - **Max depth limit**. - **Min child weight**. ### 매 tabular dominance (vs DL) - 매 small-medium tabular: 매 boosting > NN. - 매 categorical / mixed: 매 CatBoost win. - 매 large tabular: 매 LightGBM 의 fast. - 매 image / text / audio: 매 NN dominant. - 매 reason: 매 tabular 의 invariance / spatial 의 X. ### 매 modern competitor - **TabNet, FT-Transformer**: 매 tabular NN. - 매 close 가, 매 boosting 의 still match. - 매 Kaggle 2024-2026: 매 LightGBM + ensemble 의 dominant. ## 💻 패턴 ### XGBoost (basic) ```python import xgboost as xgb from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) model = xgb.XGBClassifier( n_estimators=1000, learning_rate=0.05, max_depth=6, min_child_weight=3, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.1, reg_lambda=1.0, objective='binary:logistic', eval_metric='auc', early_stopping_rounds=50, n_jobs=-1, random_state=42, ) model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100) preds = model.predict_proba(X_test)[:, 1] ``` ### LightGBM (fast) ```python import lightgbm as lgb model = lgb.LGBMClassifier( n_estimators=2000, learning_rate=0.03, num_leaves=63, # 매 2^max_depth - 1 min_child_samples=20, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=5, reg_alpha=0.1, reg_lambda=0.1, objective='binary', metric='auc', n_jobs=-1, random_state=42, ) model.fit( X_train, y_train, eval_set=[(X_val, y_val)], callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)], ) ``` ### CatBoost (categorical-friendly) ```python from catboost import CatBoostClassifier cat_features = ['gender', 'country', 'product_id'] model = CatBoostClassifier( iterations=2000, learning_rate=0.03, depth=6, l2_leaf_reg=3, cat_features=cat_features, eval_metric='AUC', early_stopping_rounds=50, random_seed=42, verbose=100, ) model.fit(X_train, y_train, eval_set=(X_val, y_val)) ``` ### Hyperparameter tune (Optuna) ```python import optuna def objective(trial): params = { 'n_estimators': 5000, 'learning_rate': trial.suggest_float('lr', 0.01, 0.1, log=True), 'num_leaves': trial.suggest_int('num_leaves', 16, 256), 'min_child_samples': trial.suggest_int('mcs', 5, 100), 'feature_fraction': trial.suggest_float('ff', 0.5, 1.0), 'bagging_fraction': trial.suggest_float('bf', 0.5, 1.0), 'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10, log=True), 'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10, log=True), } model = lgb.LGBMClassifier(**params, n_jobs=-1, random_state=42) model.fit(X_train, y_train, eval_set=[(X_val, y_val)], callbacks=[lgb.early_stopping(50, verbose=False)]) return model.best_score_['valid_0']['auc'] study = optuna.create_study(direction='maximize') study.optimize(objective, n_trials=100) ``` ### SHAP (interpretability) ```python import shap explainer = shap.TreeExplainer(model) shap_values = explainer.shap_values(X_test) # 매 global importance shap.summary_plot(shap_values, X_test) # 매 single prediction shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0]) ``` ### Stacking (meta-ensemble) ```python from sklearn.ensemble import StackingClassifier from sklearn.linear_model import LogisticRegression estimators = [ ('xgb', xgb.XGBClassifier(...)), ('lgb', lgb.LGBMClassifier(...)), ('cat', CatBoostClassifier(...)), ] stack = StackingClassifier( estimators=estimators, final_estimator=LogisticRegression(), cv=5, n_jobs=-1, ) stack.fit(X_train, y_train) ``` → 매 Kaggle 의 default 의 winning combo. ### GPU acceleration ```python # XGBoost GPU xgb.XGBClassifier(tree_method='hist', device='cuda') # LightGBM GPU lgb.LGBMClassifier(device='gpu') # CatBoost GPU CatBoostClassifier(task_type='GPU', devices='0') ``` ## 🤔 결정 기준 | 상황 | Tool | |---|---| | Default tabular | LightGBM | | Small-medium dataset | XGBoost | | Categorical-heavy | CatBoost | | Large dataset (10M+) | LightGBM (GOSS) | | GPU available | XGBoost / CatBoost GPU | | Kaggle | LightGBM + ensemble | | Production simple | LightGBM (fast) | | Interpretability | XGBoost + SHAP | **기본값**: LightGBM 의 baseline. 매 categorical 가 CatBoost. 매 ensemble 의 stack. ## 🔗 Graph - 부모: [[Ensemble-Methods]] · [[Decision Tree]] · [[데이터 사이언스 및 ML 엔지니어링|Gradient-Descent]] - 변형: [[XGBoost]] · [[LightGBM]] · [[CatBoost]] · [[AdaBoost]] · [[GBM]] - 응용: [[Kaggle]] · [[SHAP]] · [[Stacking]] - Adjacent: [[Random-Forest]] · [[Bagging]] · [[Bias vs Variance Trade-off]] · [[Optuna]] ## 🤖 LLM 활용 **언제**: 매 tabular task. 매 fraud detection. 매 Kaggle. 매 risk scoring. 매 conversion prediction. **언제 X**: 매 image / text / audio (DL). 매 sequence (RNN / Transformer). ## ❌ 안티패턴 - **No early stopping**: 매 overfit. - **High learning rate (0.5+)**: 매 unstable. - **Default 의 trust**: 매 specific 의 tune. - **Categorical 의 one-hot (high-cardinality)**: 매 CatBoost 의 lose. - **No SHAP**: 매 interpret X. - **DL 의 force on tabular**: 매 boosting 의 lose. - **Single tool**: 매 ensemble 의 lose. ## 🧪 검증 / 중복 - Verified (Chen XGBoost, Ke LightGBM, Prokhorenkova CatBoost, Kaggle dominance). - 신뢰도 A. - Related: [[XGBoost]] · [[LightGBM]] · [[CatBoost]] · [[Random-Forest]] · [[Bias vs Variance Trade-off]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — XGB / LGB / Cat + hyperparameter + 매 SHAP + stacking + tune code |