"매 오답 노트 의 군단". 매 weak learner 의 sequential 학습 + 매 previous error 의 weight ↑. 매 tabular data 의 still-king (vs deep learning). 매 Kaggle 의 default. 매 XGBoost / LightGBM / CatBoost 의 trinity.
📖 핵심
매 algorithm history
AdaBoost (1995): 매 weighted re-sample.
Gradient Boosting (Friedman 1999): 매 residual fit.
XGBoost (2014): 매 regularization + parallel.
LightGBM (2017): 매 GOSS + EFB.
CatBoost (2017): 매 ordered boosting + categorical.
Gradient Boosting Machine (GBM) 의 origin
매 model 의 sequential 추가.
매 each step: 매 negative gradient (residual) 의 fit.
매 stage-wise.
XGBoost (Extreme GBoost)
매 regularization (L1, L2) on leaf weight.
매 second-order Taylor expansion.
매 sparse-aware.
매 parallel computing (per feature).
매 missing value handling.
매 cache-aware.
LightGBM (Microsoft)
GOSS (Gradient-based One-Side Sampling): 매 high-gradient sample 의 keep.
EFB (Exclusive Feature Bundling): 매 sparse feature 의 merge.
매 leaf-wise (vs level-wise) → 매 deeper.
매 fastest, 매 large dataset friendly.
CatBoost (Yandex)
매 categorical feature 의 native.
매 ordered boosting → 매 target leakage 의 mitigate.
매 GPU support.
매 default 가 좋음.
매 hyperparameter (cross-tool)
Tree
max_depth (XGBoost) / num_leaves (LightGBM): 매 5-10.
min_child_weight / min_data_in_leaf: 매 over-fit 방지.
Learning
learning_rate (η): 매 0.01-0.3. 매 작 → 매 N tree ↑.
n_estimators / num_boost_round: 매 100-10000.
subsample: 매 row sample (0.7-1.0).
colsample_bytree: 매 feature sample.
Regularization
reg_alpha (L1): 매 sparsity.
reg_lambda (L2): 매 weight ↓.
gamma / min_split_loss: 매 split threshold.
매 over-fit 방지
Early stopping: 매 validation 의 plateau.
Low learning rate + many trees: 매 best practice.
Subsample row + col.
Regularization (reg_alpha, reg_lambda).
Max depth limit.
Min child weight.
매 tabular dominance (vs DL)
매 small-medium tabular: 매 boosting > NN.
매 categorical / mixed: 매 CatBoost win.
매 large tabular: 매 LightGBM 의 fast.
매 image / text / audio: 매 NN dominant.
매 reason: 매 tabular 의 invariance / spatial 의 X.
매 modern competitor
TabNet, FT-Transformer: 매 tabular NN.
매 close 가, 매 boosting 의 still match.
매 Kaggle 2024-2026: 매 LightGBM + ensemble 의 dominant.
importlightgbmaslgbmodel=lgb.LGBMClassifier(n_estimators=2000,learning_rate=0.03,num_leaves=63,# 매 2^max_depth - 1min_child_samples=20,feature_fraction=0.8,bagging_fraction=0.8,bagging_freq=5,reg_alpha=0.1,reg_lambda=0.1,objective='binary',metric='auc',n_jobs=-1,random_state=42,)model.fit(X_train,y_train,eval_set=[(X_val,y_val)],callbacks=[lgb.early_stopping(50),lgb.log_evaluation(100)],)
importshapexplainer=shap.TreeExplainer(model)shap_values=explainer.shap_values(X_test)# 매 global importanceshap.summary_plot(shap_values,X_test)# 매 single predictionshap.force_plot(explainer.expected_value,shap_values[0],X_test.iloc[0])
언제: 매 tabular task. 매 fraud detection. 매 Kaggle. 매 risk scoring. 매 conversion prediction.
언제 X: 매 image / text / audio (DL). 매 sequence (RNN / Transformer).
❌ 안티패턴
No early stopping: 매 overfit.
High learning rate (0.5+): 매 unstable.
Default 의 trust: 매 specific 의 tune.
Categorical 의 one-hot (high-cardinality): 매 CatBoost 의 lose.
No SHAP: 매 interpret X.
DL 의 force on tabular: 매 boosting 의 lose.
Single tool: 매 ensemble 의 lose.
🧪 검증 / 중복
Verified (Chen XGBoost, Ke LightGBM, Prokhorenkova CatBoost, Kaggle dominance).