Files
2nd/10_Wiki/Topics/AI_and_ML/Decision-Trees and Random Forests.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

7.6 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-decision-trees-and-random-forest Decision Trees and Random Forests 10_Wiki/Topics verified self
decision tree
random forest
CART
Gini
entropy
bagging
ensemble
OOB
none A 0.93 applied
decision-tree
random-forest
bagging
ensemble
classical-ml
interpretable
scikit-learn
2026-05-10 pending
language framework
Python scikit-learn

Decision Trees & Random Forests

매 한 줄

"매 if-else tree + 매 ensemble". 매 interpretable + 매 strong baseline. 매 CART (Classification And Regression Tree). 매 Random Forest = 매 N tree 의 bagging. 매 vs Boosting (XGBoost, LightGBM): bagging 의 reduce variance, boosting 의 reduce bias.

매 핵심

Decision Tree

  • 매 root → 매 leaf 의 binary split.
  • 매 split criterion: Gini, Entropy, MSE.
  • 매 hyperparameter: max_depth, min_samples_split, min_samples_leaf.

Split criterion

  • Gini: 매 P(misclassify).
  • Entropy: 매 information gain.
  • MSE / variance reduction (regression).

CART vs ID3 vs C4.5

  • ID3 (Quinlan 1986): 매 categorical, entropy.
  • C4.5 (Quinlan 1993): 매 ID3 + 매 continuous.
  • CART (Breiman 1984): 매 binary split, 매 Gini, 매 sklearn default.

Random Forest (Breiman 2001)

  • 매 N tree (bagging + 매 feature subset).
  • 매 매 tree 의 random subsample (bootstrap).
  • 매 매 split 의 random feature subset.
  • 매 vote / average.
  • 매 OOB (Out-of-Bag) error 의 자체 validation.

Bagging vs Boosting

측면 Bagging (RF) Boosting (XGBoost)
Tree training Parallel Sequential
Goal Variance ↓ Bias ↓
Sensitive to noise Less More
Default winner Robust baseline SOTA accuracy

매 응용

  1. Tabular: 매 baseline.
  2. Feature importance: 매 model interpretability.
  3. Variable selection.
  4. Imbalanced (with class weight).
  5. Mixed type (categorical + numeric).

매 strength

  • 매 no scaling 필요.
  • 매 mixed feature OK.
  • 매 outlier 의 robust.
  • 매 fast.
  • 매 interpretable (single tree).

매 weakness

  • 매 single tree 의 high variance.
  • 매 RF 의 deep tree 의 overfit.
  • 매 high-dim sparse (NLP) 의 weak.
  • 매 extrapolation 의 X (regression).

💻 패턴

Decision Tree

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

clf = DecisionTreeClassifier(
    criterion='gini',
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42,
)
clf.fit(X_train, y_train)

# 매 visualize
plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=feature_names, class_names=class_names, filled=True)
plt.show()

Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=10,
    min_samples_split=10,
    max_features='sqrt',  # 매 √n 의 feature per split
    bootstrap=True,
    oob_score=True,
    n_jobs=-1,
    random_state=42,
    class_weight='balanced',  # 매 imbalanced 의 case
)
rf.fit(X_train, y_train)

print(f'OOB: {rf.oob_score_:.3f}')
print(f'Test: {rf.score(X_test, y_test):.3f}')

Feature importance

import numpy as np
import pandas as pd

importances = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_,
}).sort_values('importance', ascending=False)

print(importances.head(10))

Permutation importance (more robust)

from sklearn.inspection import permutation_importance

result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)

for i in result.importances_mean.argsort()[::-1][:10]:
    if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
        print(f'{feature_names[i]:<20} {result.importances_mean[i]:.3f} ± {result.importances_std[i]:.3f}')

SHAP

import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

# 매 global
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

# 매 local
shap.force_plot(explainer.expected_value[0], shap_values[0][0], X_test.iloc[0])

Hyperparameter tune (Optuna)

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'min_samples_split': trial.suggest_int('mss', 2, 50),
        'min_samples_leaf': trial.suggest_int('msl', 1, 30),
        'max_features': trial.suggest_categorical('mf', ['sqrt', 'log2', None]),
    }
    rf = RandomForestClassifier(**params, n_jobs=-1, random_state=42)
    rf.fit(X_train, y_train)
    return rf.score(X_val, y_val)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Extra Trees (extreme RF)

from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier(
    n_estimators=500,
    bootstrap=False,  # 매 default no bootstrap
    n_jobs=-1,
)
# 매 더 random + 매 매 fast.

Cost-sensitive (imbalanced)

class_weight = {0: 1, 1: 10}  # 매 minority 의 10× weight

rf = RandomForestClassifier(
    n_estimators=300,
    class_weight=class_weight,  # 매 dict or 'balanced'
    n_jobs=-1,
)

Decision rules extraction

from sklearn.tree import _tree

def extract_rules(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else 'undefined'
        for i in tree_.feature
    ]
    
    def recurse(node, path):
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            yield from recurse(tree_.children_left[node], path + [f'{name} <= {threshold:.2f}'])
            yield from recurse(tree_.children_right[node], path + [f'{name} > {threshold:.2f}'])
        else:
            yield (path, tree_.value[node])
    
    return list(recurse(0, []))

매 결정 기준

상황 Algorithm
Quick baseline Random Forest
Need interpretability Single decision tree
Best accuracy (tabular) XGBoost / LightGBM
Mixed types RF
Imbalanced RF + class_weight
Cross-functional explanation RF + SHAP
Real-time inference Decision tree (cheap)

기본값: Random Forest as baseline + XGBoost as upgrade.

🔗 Graph

🤖 LLM 활용

언제: 매 tabular ML. 매 baseline. 매 interpretable model. 언제 X: 매 image / NLP / sequence (use NN). 매 strict accuracy (use boosting).

안티패턴

  • Default hyperparameter: 매 task-specific tune 필요.
  • No regularization (deep + small data): 매 overfit.
  • Single tree 의 scale 의 expect: 매 ensemble 필요.
  • Feature importance 의 single source: 매 SHAP / permutation 도 cross-check.
  • High-dim sparse data: 매 wrong tool.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — CART + RF + 매 sklearn / Optuna / SHAP / rules code