d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7.6 KiB
7.6 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-decision-trees-and-random-forest | Decision Trees and Random Forests | 10_Wiki/Topics | verified | self |
|
none | A | 0.93 | applied |
|
2026-05-10 | pending |
|
Decision Trees & Random Forests
매 한 줄
"매 if-else tree + 매 ensemble". 매 interpretable + 매 strong baseline. 매 CART (Classification And Regression Tree). 매 Random Forest = 매 N tree 의 bagging. 매 vs Boosting (XGBoost, LightGBM): bagging 의 reduce variance, boosting 의 reduce bias.
매 핵심
Decision Tree
- 매 root → 매 leaf 의 binary split.
- 매 split criterion: Gini, Entropy, MSE.
- 매 hyperparameter: max_depth, min_samples_split, min_samples_leaf.
Split criterion
- Gini: 매 P(misclassify).
- Entropy: 매 information gain.
- MSE / variance reduction (regression).
CART vs ID3 vs C4.5
- ID3 (Quinlan 1986): 매 categorical, entropy.
- C4.5 (Quinlan 1993): 매 ID3 + 매 continuous.
- CART (Breiman 1984): 매 binary split, 매 Gini, 매 sklearn default.
Random Forest (Breiman 2001)
- 매 N tree (bagging + 매 feature subset).
- 매 매 tree 의 random subsample (bootstrap).
- 매 매 split 의 random feature subset.
- 매 vote / average.
- 매 OOB (Out-of-Bag) error 의 자체 validation.
Bagging vs Boosting
| 측면 | Bagging (RF) | Boosting (XGBoost) |
|---|---|---|
| Tree training | Parallel | Sequential |
| Goal | Variance ↓ | Bias ↓ |
| Sensitive to noise | Less | More |
| Default winner | Robust baseline | SOTA accuracy |
매 응용
- Tabular: 매 baseline.
- Feature importance: 매 model interpretability.
- Variable selection.
- Imbalanced (with class weight).
- Mixed type (categorical + numeric).
매 strength
- 매 no scaling 필요.
- 매 mixed feature OK.
- 매 outlier 의 robust.
- 매 fast.
- 매 interpretable (single tree).
매 weakness
- 매 single tree 의 high variance.
- 매 RF 의 deep tree 의 overfit.
- 매 high-dim sparse (NLP) 의 weak.
- 매 extrapolation 의 X (regression).
💻 패턴
Decision Tree
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
clf = DecisionTreeClassifier(
criterion='gini',
max_depth=5,
min_samples_split=20,
min_samples_leaf=10,
random_state=42,
)
clf.fit(X_train, y_train)
# 매 visualize
plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=feature_names, class_names=class_names, filled=True)
plt.show()
Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=500,
max_depth=10,
min_samples_split=10,
max_features='sqrt', # 매 √n 의 feature per split
bootstrap=True,
oob_score=True,
n_jobs=-1,
random_state=42,
class_weight='balanced', # 매 imbalanced 의 case
)
rf.fit(X_train, y_train)
print(f'OOB: {rf.oob_score_:.3f}')
print(f'Test: {rf.score(X_test, y_test):.3f}')
Feature importance
import numpy as np
import pandas as pd
importances = pd.DataFrame({
'feature': feature_names,
'importance': rf.feature_importances_,
}).sort_values('importance', ascending=False)
print(importances.head(10))
Permutation importance (more robust)
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)
for i in result.importances_mean.argsort()[::-1][:10]:
if result.importances_mean[i] - 2 * result.importances_std[i] > 0:
print(f'{feature_names[i]:<20} {result.importances_mean[i]:.3f} ± {result.importances_std[i]:.3f}')
SHAP
import shap
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
# 매 global
shap.summary_plot(shap_values, X_test, feature_names=feature_names)
# 매 local
shap.force_plot(explainer.expected_value[0], shap_values[0][0], X_test.iloc[0])
Hyperparameter tune (Optuna)
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'max_depth': trial.suggest_int('max_depth', 3, 20),
'min_samples_split': trial.suggest_int('mss', 2, 50),
'min_samples_leaf': trial.suggest_int('msl', 1, 30),
'max_features': trial.suggest_categorical('mf', ['sqrt', 'log2', None]),
}
rf = RandomForestClassifier(**params, n_jobs=-1, random_state=42)
rf.fit(X_train, y_train)
return rf.score(X_val, y_val)
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
Extra Trees (extreme RF)
from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(
n_estimators=500,
bootstrap=False, # 매 default no bootstrap
n_jobs=-1,
)
# 매 더 random + 매 매 fast.
Cost-sensitive (imbalanced)
class_weight = {0: 1, 1: 10} # 매 minority 의 10× weight
rf = RandomForestClassifier(
n_estimators=300,
class_weight=class_weight, # 매 dict or 'balanced'
n_jobs=-1,
)
Decision rules extraction
from sklearn.tree import _tree
def extract_rules(tree, feature_names):
tree_ = tree.tree_
feature_name = [
feature_names[i] if i != _tree.TREE_UNDEFINED else 'undefined'
for i in tree_.feature
]
def recurse(node, path):
if tree_.feature[node] != _tree.TREE_UNDEFINED:
name = feature_name[node]
threshold = tree_.threshold[node]
yield from recurse(tree_.children_left[node], path + [f'{name} <= {threshold:.2f}'])
yield from recurse(tree_.children_right[node], path + [f'{name} > {threshold:.2f}'])
else:
yield (path, tree_.value[node])
return list(recurse(0, []))
매 결정 기준
| 상황 | Algorithm |
|---|---|
| Quick baseline | Random Forest |
| Need interpretability | Single decision tree |
| Best accuracy (tabular) | XGBoost / LightGBM |
| Mixed types | RF |
| Imbalanced | RF + class_weight |
| Cross-functional explanation | RF + SHAP |
| Real-time inference | Decision tree (cheap) |
기본값: Random Forest as baseline + XGBoost as upgrade.
🔗 Graph
- 부모: Ensemble-Methods
- 변형: CART · Random-Forest · Boosting-Algorithms-XGBoost-LightGBM
- 응용: Feature-Importance · SHAP
- Adjacent: Bias vs Variance Trade-off · Causal-Inference (Causal Forest) · Cross-Entropy Loss
🤖 LLM 활용
언제: 매 tabular ML. 매 baseline. 매 interpretable model. 언제 X: 매 image / NLP / sequence (use NN). 매 strict accuracy (use boosting).
❌ 안티패턴
- Default hyperparameter: 매 task-specific tune 필요.
- No regularization (deep + small data): 매 overfit.
- Single tree 의 scale 의 expect: 매 ensemble 필요.
- Feature importance 의 single source: 매 SHAP / permutation 도 cross-check.
- High-dim sparse data: 매 wrong tool.
🧪 검증 / 중복
- Verified (Breiman Random Forest 2001, scikit-learn docs, ESL).
- 신뢰도 A.
- Related: Boosting-Algorithms-XGBoost-LightGBM · Bias vs Variance Trade-off · Causal-Inference · Cross-Entropy Loss.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — CART + RF + 매 sklearn / Optuna / SHAP / rules code |