Files
2nd/10_Wiki/Topics/AI_and_ML/Causal-Inference.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

307 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-causal-inference
title: Causal Inference
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [인과 추론, causal inference, do-calculus, Pearl, DAG, counterfactual, SCM, RCT, propensity score]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [statistics, causal-inference, dag, do-calculus, pearl, ab-testing, dowhy, econml, observational-study]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python / R
framework: DoWhy / EconML / CausalML / pgmpy
---
# Causal Inference
## 📌 한 줄 통찰
> **"매 correlation 의 X — 매 cause"**. 매 Judea Pearl 의 ladder. 매 observational 의 limit + 매 RCT / DAG / counterfactual 의 fix. 매 modern AI 의 base capability — 매 LLM 의 weakest area. 매 policy / medical / business 의 critical.
## 📖 핵심
### Pearl's Ladder of Causation
1. **Association** (P(y|x)): 매 correlation. 매 standard ML.
2. **Intervention** (P(y|do(x))): 매 "what if I change x?".
3. **Counterfactual** (P(y_x|x', y')): 매 "what would have happened if?".
→ 매 LLM 의 mostly stuck on step 1.
### 매 핵심 concept
#### Confounder
- 매 X → Y 매 spurious 의 매 Z (common cause).
- 예: 매 ice cream sales ↔ drowning (Z = 매 summer).
#### Mediator
- 매 X → M → Y.
#### Collider
- 매 X → Z ← Y.
- 매 Z 의 condition 의 spurious correlation 의 induce!
#### Backdoor path
- 매 X ← Z → Y.
- 매 Z 의 control 의 close.
#### Frontdoor
- 매 X → M → Y (매 confounder 가 매 X-Y 간 의 unobserved).
### 매 method
#### RCT (gold standard)
- 매 randomization 의 confounder 의 break.
- 매 ethics / cost.
#### Observational + adjustment
- **Propensity Score Matching (PSM)**.
- **Inverse Probability Weighting (IPW)**.
- **Regression discontinuity (RDD)**.
- **Difference-in-differences (DiD)**.
- **Instrumental variables (IV)**.
- **Synthetic control**.
#### Causal graph (DAG)
- 매 explicit assumption.
- 매 do-calculus 의 identify.
#### ML-based
- **Causal forest** (Wager-Athey).
- **Double ML** (Chernozhukov).
- **CausalGAN / counterfactual VAE**.
### 매 Simpson's paradox
- 매 aggregate vs subgroup 의 reverse.
- 매 Berkeley admission, 매 kidney stone treatment.
- → 매 confounder 의 stratify.
### 매 응용
1. **A/B test** + 매 follow-up causal.
2. **Pricing**: 매 price → 매 demand.
3. **Marketing attribution**: 매 channel → 매 conversion.
4. **Medicine**: 매 treatment effect.
5. **Policy**: 매 minimum wage.
6. **Education**: 매 program effect.
7. **Recommender**: 매 click ≠ 매 caused conversion.
### 매 modern tool
- **DoWhy** (Microsoft): 매 4-step framework.
- **EconML** (Microsoft).
- **CausalML** (Uber).
- **pgmpy**: 매 graphical model.
- **GeNIe / Hugin**: 매 visual.
- **DAGitty**: 매 web DAG.
### 매 LLM 의 한계
- 매 association 의 strong.
- 매 spurious 의 confidently 의 emit.
- 매 causal reasoning 의 weak.
- 매 hybrid (LLM + symbolic causal) 의 trend.
## 💻 패턴
### DoWhy (4-step framework)
```python
from dowhy import CausalModel
model = CausalModel(
data=df,
treatment='ad_exposure',
outcome='conversion',
common_causes=['age', 'income', 'past_purchases'],
)
# 1. Identify
estimand = model.identify_effect(proceed_when_unidentifiable=False)
print(estimand)
# 2. Estimate
estimate = model.estimate_effect(estimand, method_name='backdoor.propensity_score_matching')
print(estimate.value)
# 3. Refute
refutation = model.refute_estimate(estimand, estimate, method_name='placebo_treatment_refuter')
print(refutation)
```
### Propensity Score Matching
```python
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
# 매 propensity score = P(treatment=1 | covariates)
ps_model = LogisticRegression()
ps_model.fit(X_covariates, treatment)
ps = ps_model.predict_proba(X_covariates)[:, 1]
# 매 match treated to control
treated = df[df.treatment == 1]
control = df[df.treatment == 0]
knn = NearestNeighbors(n_neighbors=1).fit(ps[control.index].reshape(-1, 1))
matches = knn.kneighbors(ps[treated.index].reshape(-1, 1), return_distance=False)
# 매 ATE estimate
ate = treated.outcome.mean() - control.iloc[matches.flatten()].outcome.mean()
```
### IPW (Inverse Probability Weighting)
```python
def ipw_ate(df, treatment, outcome, ps):
weight = np.where(df[treatment] == 1, 1 / ps, 1 / (1 - ps))
treated_avg = (df[outcome] * df[treatment] * weight).sum() / weight[df[treatment] == 1].sum()
control_avg = (df[outcome] * (1 - df[treatment]) * weight).sum() / weight[df[treatment] == 0].sum()
return treated_avg - control_avg
```
### Difference-in-Differences (DiD)
```python
import statsmodels.api as sm
# 매 panel data: pre/post × treatment/control
df['post'] = (df['period'] >= treatment_period).astype(int)
df['treated'] = (df['group'] == 'treated').astype(int)
df['interaction'] = df['post'] * df['treated']
model = sm.OLS(df['outcome'], sm.add_constant(df[['post', 'treated', 'interaction']])).fit()
# 매 interaction coefficient = 매 DiD treatment effect
print(model.summary())
```
### Causal Forest (heterogeneous treatment effect)
```python
from econml.dml import CausalForestDML
cf = CausalForestDML(
n_estimators=200,
discrete_treatment=True,
random_state=42,
)
cf.fit(Y=df['outcome'], T=df['treatment'], X=df[features], W=df[confounders])
# 매 individual treatment effect
ites = cf.effect(df_test[features])
# 매 confidence interval
lower, upper = cf.effect_interval(df_test[features], alpha=0.05)
```
### DAG + Pearl's do-calculus (pgmpy)
```python
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference.CausalInference import CausalInference
# 매 X → Y, X → Z → Y
model = BayesianNetwork([('X', 'Y'), ('X', 'Z'), ('Z', 'Y')])
# ... add CPDs ...
ci = CausalInference(model)
# 매 P(Y | do(X = 1))
result = ci.query(variables=['Y'], do={'X': 1})
print(result)
```
### Synthetic control (state policy effect)
```python
# 매 weighted combination of control units 의 treated 의 mimic
from synthetic_control import SyntheticControl # 매 hypothetical lib
sc = SyntheticControl(
treated_unit='California',
control_pool=other_states,
pre_period=range(1990, 2000),
post_period=range(2000, 2010),
)
sc.fit(predictors=['gdp', 'unemployment', 'income'])
effect = sc.treatment_effect()
```
### Refutation (sensitivity analysis)
```python
from dowhy import CausalModel
# 1. Placebo treatment
refute_placebo = model.refute_estimate(
estimand, estimate, method_name='placebo_treatment_refuter',
)
# 매 effect 의 0 가까이 → 매 robust.
# 2. Random common cause
refute_random = model.refute_estimate(
estimand, estimate, method_name='random_common_cause',
)
# 3. Data subset
refute_subset = model.refute_estimate(
estimand, estimate, method_name='data_subset_refuter',
)
```
### Simpson's paradox detector
```python
def detect_simpson(df, x_col, y_col, group_col):
# 매 aggregate
overall_corr = df[[x_col, y_col]].corr().iloc[0, 1]
# 매 subgroup
subgroup_corrs = df.groupby(group_col).apply(
lambda g: g[[x_col, y_col]].corr().iloc[0, 1]
)
if overall_corr > 0 and (subgroup_corrs < 0).all():
return f"Simpson's paradox: overall +, subgroups all -"
if overall_corr < 0 and (subgroup_corrs > 0).all():
return f"Simpson's paradox: overall -, subgroups all +"
return None
```
## 🤔 결정 기준
| 상황 | Method |
|---|---|
| New feature launch | A/B test (RCT) |
| Historical data | DoWhy + matching |
| Heterogeneous effect | Causal Forest |
| Panel data | DiD |
| Cutoff threshold | RDD |
| Hidden confounder + IV | Instrumental Variables |
| Single treated unit | Synthetic Control |
| ML-aware confounder | Double ML |
**기본값**: 매 RCT first. 매 observational 가 DoWhy + sensitivity refute.
## 🔗 Graph
- 부모: [[Statistics]] · [[Decision Theory]]
- 변형: [[DAG]] · [[Do-Calculus]] · [[Counterfactual]]
- Adjacent: [[Bayesian Statistics]] · [[Anthropic-Principle]] · [[Beliefs]] · [[Algorithmic Fairness]]
## 🤖 LLM 활용
**언제**: 매 policy decision. 매 marketing attribution. 매 medical treatment. 매 root cause analysis. 매 fairness counterfactual.
**언제 X**: 매 pure prediction (ML 의 OK). 매 LLM 의 alone (weak on step 2-3).
## ❌ 안티패턴
- **Correlation = causation**: 매 classic mistake.
- **Collider 의 control**: 매 spurious correlation 의 induce.
- **No DAG**: 매 hidden assumption.
- **Single method**: 매 sensitivity 의 X.
- **No refutation**: 매 fragile estimate.
- **Simpson's paradox 의 unaware**: 매 misleading.
- **LLM 의 causal claim 의 trust**: 매 association level 만.
## 🧪 검증 / 중복
- Verified (Pearl "Book of Why", Hernán "Causal Inference: What If", DoWhy paper).
- 신뢰도 A.
- Related: [[Bayesian Statistics]] · [[Algorithmic Fairness]] · [[Bias-Correction-Algorithm]] · [[A/B Testing]] · [[Anthropic-Principle]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Pearl ladder + DAG + 매 DoWhy / PSM / DiD / Causal Forest code |