Files
2nd/10_Wiki/Topics/AI_and_ML/Boosting-Algorithms-XGBoost-LightGBM.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

8.4 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-boosting-xgboost-lightgbm Boosting Algorithms (XGBoost / LightGBM / CatBoost) 10_Wiki/Topics verified self
boosting
gradient boosting
GBM
XGBoost
LightGBM
CatBoost
AdaBoost
ensemble
none A 0.95 applied
ml
boosting
xgboost
lightgbm
catboost
ensemble
tabular-data
kaggle
gradient-boosting
2026-05-10 pending
language framework
Python XGBoost / LightGBM / CatBoost / scikit-learn

Boosting Algorithms

📌 한 줄 통찰

"매 오답 노트 의 군단". 매 weak learner 의 sequential 학습 + 매 previous error 의 weight ↑. 매 tabular data 의 still-king (vs deep learning). 매 Kaggle 의 default. 매 XGBoost / LightGBM / CatBoost 의 trinity.

📖 핵심

매 algorithm history

  1. AdaBoost (1995): 매 weighted re-sample.
  2. Gradient Boosting (Friedman 1999): 매 residual fit.
  3. XGBoost (2014): 매 regularization + parallel.
  4. LightGBM (2017): 매 GOSS + EFB.
  5. CatBoost (2017): 매 ordered boosting + categorical.

Gradient Boosting Machine (GBM) 의 origin

  • 매 model 의 sequential 추가.
  • 매 each step: 매 negative gradient (residual) 의 fit.
  • 매 stage-wise.

XGBoost (Extreme GBoost)

  • 매 regularization (L1, L2) on leaf weight.
  • 매 second-order Taylor expansion.
  • 매 sparse-aware.
  • 매 parallel computing (per feature).
  • 매 missing value handling.
  • 매 cache-aware.

LightGBM (Microsoft)

  • GOSS (Gradient-based One-Side Sampling): 매 high-gradient sample 의 keep.
  • EFB (Exclusive Feature Bundling): 매 sparse feature 의 merge.
  • 매 leaf-wise (vs level-wise) → 매 deeper.
  • 매 fastest, 매 large dataset friendly.

CatBoost (Yandex)

  • 매 categorical feature 의 native.
  • 매 ordered boosting → 매 target leakage 의 mitigate.
  • 매 GPU support.
  • 매 default 가 좋음.

매 hyperparameter (cross-tool)

Tree

  • max_depth (XGBoost) / num_leaves (LightGBM): 매 5-10.
  • min_child_weight / min_data_in_leaf: 매 over-fit 방지.

Learning

  • learning_rate (η): 매 0.01-0.3. 매 작 → 매 N tree ↑.
  • n_estimators / num_boost_round: 매 100-10000.
  • subsample: 매 row sample (0.7-1.0).
  • colsample_bytree: 매 feature sample.

Regularization

  • reg_alpha (L1): 매 sparsity.
  • reg_lambda (L2): 매 weight ↓.
  • gamma / min_split_loss: 매 split threshold.

매 over-fit 방지

  • Early stopping: 매 validation 의 plateau.
  • Low learning rate + many trees: 매 best practice.
  • Subsample row + col.
  • Regularization (reg_alpha, reg_lambda).
  • Max depth limit.
  • Min child weight.

매 tabular dominance (vs DL)

  • 매 small-medium tabular: 매 boosting > NN.
  • 매 categorical / mixed: 매 CatBoost win.
  • 매 large tabular: 매 LightGBM 의 fast.
  • 매 image / text / audio: 매 NN dominant.
  • 매 reason: 매 tabular 의 invariance / spatial 의 X.

매 modern competitor

  • TabNet, FT-Transformer: 매 tabular NN.
  • 매 close 가, 매 boosting 의 still match.
  • 매 Kaggle 2024-2026: 매 LightGBM + ensemble 의 dominant.

💻 패턴

XGBoost (basic)

import xgboost as xgb
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=3,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    objective='binary:logistic',
    eval_metric='auc',
    early_stopping_rounds=50,
    n_jobs=-1,
    random_state=42,
)

model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100)
preds = model.predict_proba(X_test)[:, 1]

LightGBM (fast)

import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=2000,
    learning_rate=0.03,
    num_leaves=63,           # 매 2^max_depth - 1
    min_child_samples=20,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=5,
    reg_alpha=0.1,
    reg_lambda=0.1,
    objective='binary',
    metric='auc',
    n_jobs=-1,
    random_state=42,
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
)

CatBoost (categorical-friendly)

from catboost import CatBoostClassifier

cat_features = ['gender', 'country', 'product_id']

model = CatBoostClassifier(
    iterations=2000,
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=3,
    cat_features=cat_features,
    eval_metric='AUC',
    early_stopping_rounds=50,
    random_seed=42,
    verbose=100,
)

model.fit(X_train, y_train, eval_set=(X_val, y_val))

Hyperparameter tune (Optuna)

import optuna

def objective(trial):
    params = {
        'n_estimators': 5000,
        'learning_rate': trial.suggest_float('lr', 0.01, 0.1, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 16, 256),
        'min_child_samples': trial.suggest_int('mcs', 5, 100),
        'feature_fraction': trial.suggest_float('ff', 0.5, 1.0),
        'bagging_fraction': trial.suggest_float('bf', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10, log=True),
    }
    model = lgb.LGBMClassifier(**params, n_jobs=-1, random_state=42)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              callbacks=[lgb.early_stopping(50, verbose=False)])
    return model.best_score_['valid_0']['auc']

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

SHAP (interpretability)

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 매 global importance
shap.summary_plot(shap_values, X_test)

# 매 single prediction
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

Stacking (meta-ensemble)

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

estimators = [
    ('xgb', xgb.XGBClassifier(...)),
    ('lgb', lgb.LGBMClassifier(...)),
    ('cat', CatBoostClassifier(...)),
]

stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=5,
    n_jobs=-1,
)
stack.fit(X_train, y_train)

→ 매 Kaggle 의 default 의 winning combo.

GPU acceleration

# XGBoost GPU
xgb.XGBClassifier(tree_method='hist', device='cuda')

# LightGBM GPU
lgb.LGBMClassifier(device='gpu')

# CatBoost GPU
CatBoostClassifier(task_type='GPU', devices='0')

🤔 결정 기준

상황 Tool
Default tabular LightGBM
Small-medium dataset XGBoost
Categorical-heavy CatBoost
Large dataset (10M+) LightGBM (GOSS)
GPU available XGBoost / CatBoost GPU
Kaggle LightGBM + ensemble
Production simple LightGBM (fast)
Interpretability XGBoost + SHAP

기본값: LightGBM 의 baseline. 매 categorical 가 CatBoost. 매 ensemble 의 stack.

🔗 Graph

🤖 LLM 활용

언제: 매 tabular task. 매 fraud detection. 매 Kaggle. 매 risk scoring. 매 conversion prediction. 언제 X: 매 image / text / audio (DL). 매 sequence (RNN / Transformer).

안티패턴

  • No early stopping: 매 overfit.
  • High learning rate (0.5+): 매 unstable.
  • Default 의 trust: 매 specific 의 tune.
  • Categorical 의 one-hot (high-cardinality): 매 CatBoost 의 lose.
  • No SHAP: 매 interpret X.
  • DL 의 force on tabular: 매 boosting 의 lose.
  • Single tool: 매 ensemble 의 lose.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — XGB / LGB / Cat + hyperparameter + 매 SHAP + stacking + tune code