--- id: wiki-2026-0508-decision-trees-and-random-forest title: Decision Trees and Random Forests category: 10_Wiki/Topics status: verified canonical_id: self aliases: [decision tree, random forest, CART, Gini, entropy, bagging, ensemble, OOB] duplicate_of: none source_trust_level: A confidence_score: 0.93 verification_status: applied tags: [decision-tree, random-forest, bagging, ensemble, classical-ml, interpretable, scikit-learn] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn --- # Decision Trees & Random Forests ## 매 한 줄 > **"매 if-else tree + 매 ensemble"**. 매 interpretable + 매 strong baseline. 매 CART (Classification And Regression Tree). 매 Random Forest = 매 N tree 의 bagging. 매 vs Boosting (XGBoost, LightGBM): bagging 의 reduce variance, boosting 의 reduce bias. ## 매 핵심 ### Decision Tree - 매 root → 매 leaf 의 binary split. - 매 split criterion: Gini, Entropy, MSE. - 매 hyperparameter: max_depth, min_samples_split, min_samples_leaf. ### Split criterion - **Gini**: 매 P(misclassify). - **Entropy**: 매 information gain. - **MSE / variance reduction** (regression). ### CART vs ID3 vs C4.5 - **ID3** (Quinlan 1986): 매 categorical, entropy. - **C4.5** (Quinlan 1993): 매 ID3 + 매 continuous. - **CART** (Breiman 1984): 매 binary split, 매 Gini, 매 sklearn default. ### Random Forest (Breiman 2001) - 매 N tree (bagging + 매 feature subset). - 매 매 tree 의 random subsample (bootstrap). - 매 매 split 의 random feature subset. - 매 vote / average. - 매 OOB (Out-of-Bag) error 의 자체 validation. ### Bagging vs Boosting | 측면 | Bagging (RF) | Boosting (XGBoost) | |---|---|---| | Tree training | Parallel | Sequential | | Goal | Variance ↓ | Bias ↓ | | Sensitive to noise | Less | More | | Default winner | Robust baseline | SOTA accuracy | ### 매 응용 1. **Tabular**: 매 baseline. 2. **Feature importance**: 매 model interpretability. 3. **Variable selection**. 4. **Imbalanced (with class weight)**. 5. **Mixed type** (categorical + numeric). ### 매 strength - 매 no scaling 필요. - 매 mixed feature OK. - 매 outlier 의 robust. - 매 fast. - 매 interpretable (single tree). ### 매 weakness - 매 single tree 의 high variance. - 매 RF 의 deep tree 의 overfit. - 매 high-dim sparse (NLP) 의 weak. - 매 extrapolation 의 X (regression). ## 💻 패턴 ### Decision Tree ```python from sklearn.tree import DecisionTreeClassifier, plot_tree import matplotlib.pyplot as plt clf = DecisionTreeClassifier( criterion='gini', max_depth=5, min_samples_split=20, min_samples_leaf=10, random_state=42, ) clf.fit(X_train, y_train) # 매 visualize plt.figure(figsize=(20, 10)) plot_tree(clf, feature_names=feature_names, class_names=class_names, filled=True) plt.show() ``` ### Random Forest ```python from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier( n_estimators=500, max_depth=10, min_samples_split=10, max_features='sqrt', # 매 √n 의 feature per split bootstrap=True, oob_score=True, n_jobs=-1, random_state=42, class_weight='balanced', # 매 imbalanced 의 case ) rf.fit(X_train, y_train) print(f'OOB: {rf.oob_score_:.3f}') print(f'Test: {rf.score(X_test, y_test):.3f}') ``` ### Feature importance ```python import numpy as np import pandas as pd importances = pd.DataFrame({ 'feature': feature_names, 'importance': rf.feature_importances_, }).sort_values('importance', ascending=False) print(importances.head(10)) ``` ### Permutation importance (more robust) ```python from sklearn.inspection import permutation_importance result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1) for i in result.importances_mean.argsort()[::-1][:10]: if result.importances_mean[i] - 2 * result.importances_std[i] > 0: print(f'{feature_names[i]:<20} {result.importances_mean[i]:.3f} ± {result.importances_std[i]:.3f}') ``` ### SHAP ```python import shap explainer = shap.TreeExplainer(rf) shap_values = explainer.shap_values(X_test) # 매 global shap.summary_plot(shap_values, X_test, feature_names=feature_names) # 매 local shap.force_plot(explainer.expected_value[0], shap_values[0][0], X_test.iloc[0]) ``` ### Hyperparameter tune (Optuna) ```python import optuna def objective(trial): params = { 'n_estimators': trial.suggest_int('n_estimators', 100, 1000), 'max_depth': trial.suggest_int('max_depth', 3, 20), 'min_samples_split': trial.suggest_int('mss', 2, 50), 'min_samples_leaf': trial.suggest_int('msl', 1, 30), 'max_features': trial.suggest_categorical('mf', ['sqrt', 'log2', None]), } rf = RandomForestClassifier(**params, n_jobs=-1, random_state=42) rf.fit(X_train, y_train) return rf.score(X_val, y_val) study = optuna.create_study(direction='maximize') study.optimize(objective, n_trials=50) ``` ### Extra Trees (extreme RF) ```python from sklearn.ensemble import ExtraTreesClassifier et = ExtraTreesClassifier( n_estimators=500, bootstrap=False, # 매 default no bootstrap n_jobs=-1, ) # 매 더 random + 매 매 fast. ``` ### Cost-sensitive (imbalanced) ```python class_weight = {0: 1, 1: 10} # 매 minority 의 10× weight rf = RandomForestClassifier( n_estimators=300, class_weight=class_weight, # 매 dict or 'balanced' n_jobs=-1, ) ``` ### Decision rules extraction ```python from sklearn.tree import _tree def extract_rules(tree, feature_names): tree_ = tree.tree_ feature_name = [ feature_names[i] if i != _tree.TREE_UNDEFINED else 'undefined' for i in tree_.feature ] def recurse(node, path): if tree_.feature[node] != _tree.TREE_UNDEFINED: name = feature_name[node] threshold = tree_.threshold[node] yield from recurse(tree_.children_left[node], path + [f'{name} <= {threshold:.2f}']) yield from recurse(tree_.children_right[node], path + [f'{name} > {threshold:.2f}']) else: yield (path, tree_.value[node]) return list(recurse(0, [])) ``` ## 매 결정 기준 | 상황 | Algorithm | |---|---| | Quick baseline | Random Forest | | Need interpretability | Single decision tree | | Best accuracy (tabular) | XGBoost / LightGBM | | Mixed types | RF | | Imbalanced | RF + class_weight | | Cross-functional explanation | RF + SHAP | | Real-time inference | Decision tree (cheap) | **기본값**: Random Forest as baseline + XGBoost as upgrade. ## 🔗 Graph - 부모: [[Ensemble-Methods]] - 변형: [[CART]] · [[Random-Forest]] · [[Boosting-Algorithms-XGBoost-LightGBM]] - 응용: [[Feature-Importance]] · [[SHAP]] - Adjacent: [[Bias vs Variance Trade-off]] · [[Causal-Inference]] (Causal Forest) · [[Cross-Entropy Loss]] ## 🤖 LLM 활용 **언제**: 매 tabular ML. 매 baseline. 매 interpretable model. **언제 X**: 매 image / NLP / sequence (use NN). 매 strict accuracy (use boosting). ## ❌ 안티패턴 - **Default hyperparameter**: 매 task-specific tune 필요. - **No regularization** (deep + small data): 매 overfit. - **Single tree 의 scale 의 expect**: 매 ensemble 필요. - **Feature importance 의 single source**: 매 SHAP / permutation 도 cross-check. - **High-dim sparse data**: 매 wrong tool. ## 🧪 검증 / 중복 - Verified (Breiman Random Forest 2001, scikit-learn docs, ESL). - 신뢰도 A. - Related: [[Boosting-Algorithms-XGBoost-LightGBM]] · [[Bias vs Variance Trade-off]] · [[Causal-Inference]] · [[Cross-Entropy Loss]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — CART + RF + 매 sklearn / Optuna / SHAP / rules code |