Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

7.6 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Decision Trees & Random Forests

매 한 줄

"매 if-else tree + 매 ensemble". 매 interpretable + 매 strong baseline. 매 CART (Classification And Regression Tree). 매 Random Forest = 매 N tree 의 bagging. 매 vs Boosting (XGBoost, LightGBM): bagging 의 reduce variance, boosting 의 reduce bias.

매 핵심

Decision Tree

매 root → 매 leaf 의 binary split.
매 split criterion: Gini, Entropy, MSE.
매 hyperparameter: max_depth, min_samples_split, min_samples_leaf.

Split criterion

Gini: 매 P(misclassify).
Entropy: 매 information gain.
MSE / variance reduction (regression).

CART vs ID3 vs C4.5

ID3 (Quinlan 1986): 매 categorical, entropy.
C4.5 (Quinlan 1993): 매 ID3 + 매 continuous.
CART (Breiman 1984): 매 binary split, 매 Gini, 매 sklearn default.

Random Forest (Breiman 2001)

매 N tree (bagging + 매 feature subset).
매 매 tree 의 random subsample (bootstrap).
매 매 split 의 random feature subset.
매 vote / average.
매 OOB (Out-of-Bag) error 의 자체 validation.

Bagging vs Boosting

측면	Bagging (RF)	Boosting (XGBoost)
Tree training	Parallel	Sequential
Goal	Variance ↓	Bias ↓
Sensitive to noise	Less	More
Default winner	Robust baseline	SOTA accuracy

매 응용

Tabular: 매 baseline.
Feature importance: 매 model interpretability.
Variable selection.
Imbalanced (with class weight).
Mixed type (categorical + numeric).

매 strength

매 no scaling 필요.
매 mixed feature OK.
매 outlier 의 robust.
매 fast.
매 interpretable (single tree).

매 weakness

매 single tree 의 high variance.
매 RF 의 deep tree 의 overfit.
매 high-dim sparse (NLP) 의 weak.
매 extrapolation 의 X (regression).

💻 패턴

Decision Tree

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

clf = DecisionTreeClassifier(
    criterion='gini',
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42,
)
clf.fit(X_train, y_train)

# 매 visualize
plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=feature_names, class_names=class_names, filled=True)
plt.show()

Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=10,
    min_samples_split=10,
    max_features='sqrt',  # 매 √n 의 feature per split
    bootstrap=True,
    oob_score=True,
    n_jobs=-1,
    random_state=42,
    class_weight='balanced',  # 매 imbalanced 의 case
)
rf.fit(X_train, y_train)

print(f'OOB: {rf.oob_score_:.3f}')
print(f'Test: {rf.score(X_test, y_test):.3f}')

Feature importance

import numpy as np
import pandas as pd

importances = pd.DataFrame({
    'feature': feature_names,
    'importance': rf.feature_importances_,
}).sort_values('importance', ascending=False)

print(importances.head(10))

Permutation importance (more robust)

from sklearn.inspection import permutation_importance

result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)

for i in result.importances_mean.argsort()[::-1][:10]:
    if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
        print(f'{feature_names[i]:<20} {result.importances_mean[i]:.3f} ± {result.importances_std[i]:.3f}')

SHAP

import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

# 매 global
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

# 매 local
shap.force_plot(explainer.expected_value[0], shap_values[0][0], X_test.iloc[0])

Hyperparameter tune (Optuna)

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 3, 20),
        'min_samples_split': trial.suggest_int('mss', 2, 50),
        'min_samples_leaf': trial.suggest_int('msl', 1, 30),
        'max_features': trial.suggest_categorical('mf', ['sqrt', 'log2', None]),
    }
    rf = RandomForestClassifier(**params, n_jobs=-1, random_state=42)
    rf.fit(X_train, y_train)
    return rf.score(X_val, y_val)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Extra Trees (extreme RF)

from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier(
    n_estimators=500,
    bootstrap=False,  # 매 default no bootstrap
    n_jobs=-1,
)
# 매 더 random + 매 매 fast.

Cost-sensitive (imbalanced)

class_weight = {0: 1, 1: 10}  # 매 minority 의 10× weight

rf = RandomForestClassifier(
    n_estimators=300,
    class_weight=class_weight,  # 매 dict or 'balanced'
    n_jobs=-1,
)

Decision rules extraction

from sklearn.tree import _tree

def extract_rules(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else 'undefined'
        for i in tree_.feature
    ]
    
    def recurse(node, path):
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            yield from recurse(tree_.children_left[node], path + [f'{name} <= {threshold:.2f}'])
            yield from recurse(tree_.children_right[node], path + [f'{name} > {threshold:.2f}'])
        else:
            yield (path, tree_.value[node])
    
    return list(recurse(0, []))

매 결정 기준

상황	Algorithm
Quick baseline	Random Forest
Need interpretability	Single decision tree
Best accuracy (tabular)	XGBoost / LightGBM
Mixed types	RF
Imbalanced	RF + class_weight
Cross-functional explanation	RF + SHAP
Real-time inference	Decision tree (cheap)

기본값: Random Forest as baseline + XGBoost as upgrade.

🔗 Graph

부모: Ensemble-Methods
변형: CART · Random-Forest · Boosting-Algorithms-XGBoost-LightGBM
응용: Feature-Importance · SHAP
Adjacent: Bias vs Variance Trade-off · Causal-Inference (Causal Forest) · Cross-Entropy Loss

🤖 LLM 활용

언제: 매 tabular ML. 매 baseline. 매 interpretable model. 언제 X: 매 image / NLP / sequence (use NN). 매 strict accuracy (use boosting).

❌ 안티패턴

Default hyperparameter: 매 task-specific tune 필요.
No regularization (deep + small data): 매 overfit.
Single tree 의 scale 의 expect: 매 ensemble 필요.
Feature importance 의 single source: 매 SHAP / permutation 도 cross-check.
High-dim sparse data: 매 wrong tool.

🧪 검증 / 중복

Verified (Breiman Random Forest 2001, scikit-learn docs, ESL).
신뢰도 A.
Related: Boosting-Algorithms-XGBoost-LightGBM · Bias vs Variance Trade-off · Causal-Inference · Cross-Entropy Loss.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — CART + RF + 매 sklearn / Optuna / SHAP / rules code

7.6 KiB Raw Blame History Unescape Escape

Decision Trees & Random Forests

매 한 줄

매 핵심

Decision Tree

Split criterion

CART vs ID3 vs C4.5

Random Forest (Breiman 2001)

Bagging vs Boosting

매 응용

매 strength

매 weakness

💻 패턴

Decision Tree

Random Forest

Feature importance

Permutation importance (more robust)

SHAP

Hyperparameter tune (Optuna)

Extra Trees (extreme RF)

Cost-sensitive (imbalanced)

Decision rules extraction

매 결정 기준

🔗 Graph

🤖 LLM 활용

❌ 안티패턴

🧪 검증 / 중복

🕓 Changelog

7.6 KiB

Raw Blame History