Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

8.3 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Boosting Algorithms

📌 한 줄 통찰

"매 오답 노트 의 군단". 매 weak learner 의 sequential 학습 + 매 previous error 의 weight ↑. 매 tabular data 의 still-king (vs deep learning). 매 Kaggle 의 default. 매 XGBoost / LightGBM / CatBoost 의 trinity.

📖 핵심

매 algorithm history

AdaBoost (1995): 매 weighted re-sample.
Gradient Boosting (Friedman 1999): 매 residual fit.
XGBoost (2014): 매 regularization + parallel.
LightGBM (2017): 매 GOSS + EFB.
CatBoost (2017): 매 ordered boosting + categorical.

Gradient Boosting Machine (GBM) 의 origin

매 model 의 sequential 추가.
매 each step: 매 negative gradient (residual) 의 fit.
매 stage-wise.

XGBoost (Extreme GBoost)

매 regularization (L1, L2) on leaf weight.
매 second-order Taylor expansion.
매 sparse-aware.
매 parallel computing (per feature).
매 missing value handling.
매 cache-aware.

LightGBM (Microsoft)

GOSS (Gradient-based One-Side Sampling): 매 high-gradient sample 의 keep.
EFB (Exclusive Feature Bundling): 매 sparse feature 의 merge.
매 leaf-wise (vs level-wise) → 매 deeper.
매 fastest, 매 large dataset friendly.

CatBoost (Yandex)

매 categorical feature 의 native.
매 ordered boosting → 매 target leakage 의 mitigate.
매 GPU support.
매 default 가 좋음.

매 hyperparameter (cross-tool)

Tree

max_depth (XGBoost) / num_leaves (LightGBM): 매 5-10.
min_child_weight / min_data_in_leaf: 매 over-fit 방지.

Learning

learning_rate (η): 매 0.01-0.3. 매 작 → 매 N tree ↑.
n_estimators / num_boost_round: 매 100-10000.
subsample: 매 row sample (0.7-1.0).
colsample_bytree: 매 feature sample.

Regularization

reg_alpha (L1): 매 sparsity.
reg_lambda (L2): 매 weight ↓.
gamma / min_split_loss: 매 split threshold.

매 over-fit 방지

Early stopping: 매 validation 의 plateau.
Low learning rate + many trees: 매 best practice.
Subsample row + col.
Regularization (reg_alpha, reg_lambda).
Max depth limit.
Min child weight.

매 tabular dominance (vs DL)

매 small-medium tabular: 매 boosting > NN.
매 categorical / mixed: 매 CatBoost win.
매 large tabular: 매 LightGBM 의 fast.
매 image / text / audio: 매 NN dominant.
매 reason: 매 tabular 의 invariance / spatial 의 X.

매 modern competitor

TabNet, FT-Transformer: 매 tabular NN.
매 close 가, 매 boosting 의 still match.
매 Kaggle 2024-2026: 매 LightGBM + ensemble 의 dominant.

💻 패턴

XGBoost (basic)

import xgboost as xgb
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    min_child_weight=3,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    objective='binary:logistic',
    eval_metric='auc',
    early_stopping_rounds=50,
    n_jobs=-1,
    random_state=42,
)

model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=100)
preds = model.predict_proba(X_test)[:, 1]

LightGBM (fast)

import lightgbm as lgb

model = lgb.LGBMClassifier(
    n_estimators=2000,
    learning_rate=0.03,
    num_leaves=63,           # 매 2^max_depth - 1
    min_child_samples=20,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    bagging_freq=5,
    reg_alpha=0.1,
    reg_lambda=0.1,
    objective='binary',
    metric='auc',
    n_jobs=-1,
    random_state=42,
)

model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)],
)

CatBoost (categorical-friendly)

from catboost import CatBoostClassifier

cat_features = ['gender', 'country', 'product_id']

model = CatBoostClassifier(
    iterations=2000,
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=3,
    cat_features=cat_features,
    eval_metric='AUC',
    early_stopping_rounds=50,
    random_seed=42,
    verbose=100,
)

model.fit(X_train, y_train, eval_set=(X_val, y_val))

Hyperparameter tune (Optuna)

import optuna

def objective(trial):
    params = {
        'n_estimators': 5000,
        'learning_rate': trial.suggest_float('lr', 0.01, 0.1, log=True),
        'num_leaves': trial.suggest_int('num_leaves', 16, 256),
        'min_child_samples': trial.suggest_int('mcs', 5, 100),
        'feature_fraction': trial.suggest_float('ff', 0.5, 1.0),
        'bagging_fraction': trial.suggest_float('bf', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10, log=True),
    }
    model = lgb.LGBMClassifier(**params, n_jobs=-1, random_state=42)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
              callbacks=[lgb.early_stopping(50, verbose=False)])
    return model.best_score_['valid_0']['auc']

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

SHAP (interpretability)

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 매 global importance
shap.summary_plot(shap_values, X_test)

# 매 single prediction
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0])

Stacking (meta-ensemble)

from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

estimators = [
    ('xgb', xgb.XGBClassifier(...)),
    ('lgb', lgb.LGBMClassifier(...)),
    ('cat', CatBoostClassifier(...)),
]

stack = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(),
    cv=5,
    n_jobs=-1,
)
stack.fit(X_train, y_train)

→ 매 Kaggle 의 default 의 winning combo.

GPU acceleration

# XGBoost GPU
xgb.XGBClassifier(tree_method='hist', device='cuda')

# LightGBM GPU
lgb.LGBMClassifier(device='gpu')

# CatBoost GPU
CatBoostClassifier(task_type='GPU', devices='0')

🤔 결정 기준

상황	Tool
Default tabular	LightGBM
Small-medium dataset	XGBoost
Categorical-heavy	CatBoost
Large dataset (10M+)	LightGBM (GOSS)
GPU available	XGBoost / CatBoost GPU
Kaggle	LightGBM + ensemble
Production simple	LightGBM (fast)
Interpretability	XGBoost + SHAP

기본값: LightGBM 의 baseline. 매 categorical 가 CatBoost. 매 ensemble 의 stack.

🔗 Graph

부모: Ensemble-Methods · Decision-Tree · Gradient-Descent
변형: XGBoost · LightGBM · CatBoost · AdaBoost · GBM
응용: Kaggle · Tabular-ML · SHAP · Stacking
Adjacent: Random-Forest · Bagging · Bias-vs-Variance · Optuna

🤖 LLM 활용

언제: 매 tabular task. 매 fraud detection. 매 Kaggle. 매 risk scoring. 매 conversion prediction. 언제 X: 매 image / text / audio (DL). 매 sequence (RNN / Transformer).

❌ 안티패턴

No early stopping: 매 overfit.
High learning rate (0.5+): 매 unstable.
Default 의 trust: 매 specific 의 tune.
Categorical 의 one-hot (high-cardinality): 매 CatBoost 의 lose.
No SHAP: 매 interpret X.
DL 의 force on tabular: 매 boosting 의 lose.
Single tool: 매 ensemble 의 lose.

🧪 검증 / 중복

Verified (Chen XGBoost, Ke LightGBM, Prokhorenkova CatBoost, Kaggle dominance).
신뢰도 A.
Related: XGBoost · LightGBM · CatBoost · Random-Forest · Bias-vs-Variance.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — XGB / LGB / Cat + hyperparameter + 매 SHAP + stacking + tune code

8.3 KiB Raw Blame History