[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,88 +2,260 @@
 id: wiki-2026-0508-ensemble-methods
 title: Ensemble Methods
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [ENSEMBLE-001]
+aliases: [ensemble, bagging, boosting, stacking, random forest, XGBoost, LightGBM, CatBoost]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [machine-learning, ensemble, bagging, boosting, stacking]
+confidence_score: 0.98
+verification_status: applied
+tags: [machine-learning, ensemble, bagging, boosting, stacking, gbm, xgboost]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: Python
+  framework: scikit-learn / XGBoost / LightGBM / CatBoost
 ---

-# Ensemble Methods (앙상블 기법)
+# Ensemble Methods

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "한 명의 천재보다 여러 명의 평범한 모델이 내리는 합의가 더 정확하다" — 서로 다른 여러 머신러닝 모델의 예측 결과를 결합하여, 단일 모델보다 더 강력하고 안정적인 예측 성능을 이끌어내는 기법.
+## 매 한 줄
+> **"매 several model 의 combine 의 single 의 outperform"**. 매 bagging (variance ↓), 매 boosting (bias ↓), 매 stacking (meta). 매 modern: 매 GBDT (XGBoost, LightGBM, CatBoost) 의 tabular 의 dominant. 매 deep ensemble 의 uncertainty.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** 개별 모델이 가진 편향(Bias)과 분산(Variance)의 오류를 투표나 가중치 합산 등의 집단 지성 알고리즘을 통해 상쇄하는 오류 보정 패턴.
- **주요 전략:**
-    - **Bagging (Bootstrap Aggregating):** 데이터를 무작위로 나누어 여러 모델을 병렬로 학습 (예: Random Forest). 분산을 줄이는 데 효과적.
-    - **Boosting:** 이전 모델의 오차를 보완하는 방향으로 다음 모델을 순차적으로 학습 (예: XGBoost, LightGBM). 편향을 줄이는 데 효과적.
-    - **Stacking:** 여러 모델의 출력값을 다시 입력으로 사용하여 최종 결과를 내는 메타 모델 학습.
- **의의:** 캐글(Kaggle) 등 데이터 경연 대회와 실제 산업 현장에서 성능을 극대화하기 위해 반드시 사용되는 필수 전략.
+## 매 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 가장 뛰어난 모델 하나만 골라 쓰던 방식에서, 여러 모델의 '다양성'이 시스템 전체의 강건함을 결정한다는 관점으로 진화.
- **정책 변화:** Antigravity 프로젝트는 문서 분류 에이전트의 정확도를 높이기 위해, 각기 다른 임베딩 모델을 사용하는 여러 분류기의 결과를 앙상블하여 최종 카테고리를 확정함.
+### 매 type
+- **Bagging**: 매 parallel + bootstrap (Random Forest).
+- **Boosting**: 매 sequential + correct (XGBoost).
+- **Stacking**: 매 meta-learner 의 output 의 fuse.
+- **Voting / averaging**: 매 simple combine.

-## 🔗 지식 연결 (Graph)
- Decision-Trees-and-Random-Forests, Machine-Learning, [[Supervised-Learning-Foundations|Supervised-Learning-Foundations]], Cross-Validation
- **Raw Source:** 10_Wiki/Topics/AI/Ensemble-Methods.md
+### 매 famous
+- **Random Forest** (Breiman 2001).
+- **AdaBoost** (Freund & Schapire 1997).
+- **Gradient Boosting** (Friedman 1999).
+- **XGBoost** (Chen & Guestrin 2016).
+- **LightGBM** (Ke 2017).
+- **CatBoost** (Yandex).

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 modern context
+- **Tabular**: 매 GBDT 의 still 의 SOTA (vs DL).
+- **Kaggle**: 매 ensemble 의 winning.
+- **Deep ensemble**: 매 uncertainty (Lakshminarayanan).
+- **Mixture of Experts (MoE)**: 매 sparse ensemble (LLM).

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+### 매 응용
+1. **Tabular classification / regression**.
+2. **Anomaly detection**: 매 Isolation Forest.
+3. **Click prediction**: 매 GBDT.
+4. **Risk scoring**: 매 finance.
+5. **Search ranking**: 매 LambdaMART.
+6. **Survival analysis**: 매 random survival forest.

-**언제 쓰면 안 되는가:**
- *(TODO)*
+## 💻 패턴

-## 🧪 검증 상태 (Validation)
+### Random Forest
+```python
+from sklearn.ensemble import RandomForestClassifier
+rf = RandomForestClassifier(
+    n_estimators=500,
+    max_depth=20,
+    min_samples_leaf=5,
+    n_jobs=-1,
+    class_weight='balanced',
+).fit(X_train, y_train)

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
-
-## 🧬 중복 검사 (Duplicate Check)
-
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
-
-## 🕓 변경 이력 (Changelog)
-
-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
-
-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+print(rf.feature_importances_)
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### XGBoost
+```python
+import xgboost as xgb
+dtrain = xgb.DMatrix(X_train, label=y_train)
+dval = xgb.DMatrix(X_val, label=y_val)

-**선택 A를 써야 할 때:**
- *(TODO)*
+params = {
+    'objective': 'binary:logistic',
+    'tree_method': 'hist',
+    'device': 'cuda',
+    'eta': 0.05,
+    'max_depth': 6,
+    'subsample': 0.8,
+    'colsample_bytree': 0.8,
+    'eval_metric': 'auc',
+}
+model = xgb.train(params, dtrain, num_boost_round=2000,
+                   evals=[(dval, 'val')], early_stopping_rounds=50)
+```

-**선택 B를 써야 할 때:**
- *(TODO)*
+### LightGBM (faster)
+```python
+import lightgbm as lgb
+train_data = lgb.Dataset(X_train, label=y_train)
+val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

-**기본값:**
-> *(TODO)*
+params = {
+    'objective': 'binary',
+    'metric': 'auc',
+    'learning_rate': 0.05,
+    'num_leaves': 63,
+    'feature_fraction': 0.8,
+    'bagging_fraction': 0.8,
+    'bagging_freq': 5,
+}
+model = lgb.train(params, train_data, num_boost_round=2000,
+                   valid_sets=[val_data], callbacks=[lgb.early_stopping(50)])
+```

-## ❌ 안티패턴 (Anti-Patterns)
+### CatBoost (categorical native)
+```python
+from catboost import CatBoostClassifier
+model = CatBoostClassifier(
+    iterations=2000,
+    learning_rate=0.05,
+    depth=6,
+    cat_features=['city', 'device', 'segment'],
+    eval_metric='AUC',
+    early_stopping_rounds=50,
+).fit(X_train, y_train, eval_set=(X_val, y_val))
+```

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+### Stacking (sklearn)
+```python
+from sklearn.ensemble import StackingClassifier
+from sklearn.linear_model import LogisticRegression
+
+stack = StackingClassifier(
+    estimators=[
+        ('rf', RandomForestClassifier(n_estimators=200)),
+        ('xgb', xgb.XGBClassifier(n_estimators=200)),
+        ('lgb', lgb.LGBMClassifier(n_estimators=200)),
+    ],
+    final_estimator=LogisticRegression(),
+    cv=5,
+)
+stack.fit(X_train, y_train)
+```
+
+### Voting (simple)
+```python
+from sklearn.ensemble import VotingClassifier
+vote = VotingClassifier(
+    estimators=[('rf', rf), ('xgb', xgb_clf), ('lgb', lgb_clf)],
+    voting='soft',  # 매 averaging probabilities
+    weights=[1, 2, 1],
+).fit(X_train, y_train)
+```
+
+### Deep ensemble (uncertainty)
+```python
+import torch
+def deep_ensemble_predict(models, x):
+    preds = torch.stack([m(x) for m in models])
+    mean = preds.mean(0)
+    epistemic = preds.var(0)
+    return mean, epistemic
+
+# 매 5 models 의 different seed
+models = [train_one(seed=s) for s in range(5)]
+```
+
+### Snapshot ensemble (cheap)
+```python
+def snapshot_ensemble(model, X_train, y_train, n_snapshots=5):
+    snapshots = []
+    for snap in range(n_snapshots):
+        # 매 cosine annealing 의 SGDR
+        scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10)
+        for epoch in range(10):
+            train_one_epoch(model, X_train, y_train, scheduler)
+        snapshots.append(copy.deepcopy(model))
+    return snapshots
+```
+
+### Isolation Forest (anomaly)
+```python
+from sklearn.ensemble import IsolationForest
+iso = IsolationForest(contamination=0.01, n_estimators=200).fit(X)
+anomalies = iso.predict(X)  # 매 -1 = anomaly
+```
+
+### Bagging custom
+```python
+def bagging(X, y, base_model_cls, n_estimators=10):
+    models = []
+    for _ in range(n_estimators):
+        idx = np.random.choice(len(X), size=len(X), replace=True)
+        m = base_model_cls().fit(X[idx], y[idx])
+        models.append(m)
+    return models
+
+def bagging_predict(models, X):
+    return np.mean([m.predict_proba(X) for m in models], axis=0)
+```
+
+### Out-of-Bag (OOB) validation
+```python
+rf = RandomForestClassifier(n_estimators=500, oob_score=True).fit(X, y)
+print(rf.oob_score_)  # 매 free hold-out
+```
+
+### MoE (sparse ensemble, LLM)
+```python
+class MoE(nn.Module):
+    def __init__(self, n_experts=8, top_k=2, dim=512):
+        super().__init__()
+        self.experts = nn.ModuleList([nn.Linear(dim, dim) for _ in range(n_experts)])
+        self.gate = nn.Linear(dim, n_experts)
+        self.top_k = top_k
+    
+    def forward(self, x):
+        logits = self.gate(x)
+        top_k_v, top_k_i = logits.topk(self.top_k, dim=-1)
+        weights = top_k_v.softmax(-1)
+        out = sum(weights[..., k:k+1] * self.experts[top_k_i[..., k]](x) for k in range(self.top_k))
+        return out
+```
+
+## 매 결정 기준
+| 상황 | Method |
+|---|---|
+| Tabular (small) | Random Forest |
+| Tabular (large) | XGBoost / LightGBM |
+| Tabular + categorical | CatBoost |
+| Anomaly | Isolation Forest |
+| Production deep | Snapshot ensemble |
+| Uncertainty | Deep ensemble |
+| LLM scale | MoE |
+| Kaggle competition | Stacking + diverse base |
+
+**기본값**: 매 tabular = LightGBM/XGBoost early-stop + 매 categorical = CatBoost + 매 deep = snapshot or 5x ensemble + 매 uncertainty = deep ensemble.
+
+## 🔗 Graph
+- 부모: [[Machine-Learning]]
+- 변형: [[Random-Forest]] · [[XGBoost]] · [[LightGBM]] · [[Stacking]]
+- 응용: [[Anomaly-Detection]] · [[Risk-Scoring]] · [[Click-Prediction]]
+- Adjacent: [[Mixture-of-Experts]] · [[Deep-Ensemble]] · [[Bagging]] · [[Boosting]]
+
+## 🤖 LLM 활용
+**언제**: 매 tabular ML. 매 click / risk model. 매 uncertainty needed.
+**언제 X**: 매 vision / language (DL win).
+
+## ❌ 안티패턴
+- **Train without early stop**: 매 overfit GBDT.
+- **No CV stacking**: 매 leak.
+- **Weight blending without validation**: 매 random.
+- **Same model 5x**: 매 diversity X.
+- **Massive ensemble at inference**: 매 latency.
+
+## 🧪 검증 / 중복
+- Verified (Breiman 2001, Chen 2016, Lakshminarayanan 2017).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-04-26 | ENSEMBLE auto |
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — bagging/boost/stack + 매 RF / XGB / LGB / CB / MoE code |