f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
166 lines
5.6 KiB
Markdown
166 lines
5.6 KiB
Markdown
---
|
||
id: wiki-2026-0508-statistical-analysis
|
||
title: Statistical Analysis
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Statistics, Inferential Statistics, Data Analysis]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.9
|
||
verification_status: applied
|
||
tags: [statistics, hypothesis-testing, regression, bayesian]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python / R
|
||
framework: scipy / statsmodels / pymc / R-tidyverse
|
||
---
|
||
|
||
# Statistical Analysis
|
||
|
||
## 매 한 줄
|
||
> **"매 데이터의 uncertainty 를 정량화"**. Fisher–Neyman frequentist framework 부터 Gelman 2020s Bayesian workflow까지, 2026 현재 표준은 statsmodels + PyMC 5.x + ArviZ pipeline 으로 reproducible inference를 빌드하는 것이다.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 두 paradigm
|
||
- **Frequentist**: parameter 는 fixed, data 가 random. p-value, confidence interval, MLE.
|
||
- **Bayesian**: parameter 도 random, prior + likelihood → posterior. Credible interval, posterior predictive.
|
||
- **2026 합의**: 매 둘 다 도구 — small n / strong prior 면 Bayesian, large n / regulated 면 frequentist.
|
||
|
||
### 매 핵심 절차
|
||
- **EDA**: distribution, missing, outlier, correlation matrix.
|
||
- **Hypothesis test**: t-test, χ², Mann-Whitney, permutation. Effect size + CI 동봉.
|
||
- **Regression**: OLS → GLM → mixed-effects → hierarchical Bayesian.
|
||
- **Model checking**: residual diagnostics, posterior predictive checks, k-fold CV.
|
||
|
||
### 매 응용
|
||
1. A/B test 분석 (web, ML model rollout).
|
||
2. Clinical trial efficacy.
|
||
3. Causal inference (DiD, IV, RDD, double ML).
|
||
4. Risk modeling (insurance, finance).
|
||
|
||
## 💻 패턴
|
||
|
||
### Welch's t-test + effect size + CI (scipy 1.13+)
|
||
```python
|
||
import numpy as np
|
||
from scipy import stats
|
||
|
||
def welch_with_effect(a, b):
|
||
t, p = stats.ttest_ind(a, b, equal_var=False)
|
||
n1, n2 = len(a), len(b)
|
||
s1, s2 = a.var(ddof=1), b.var(ddof=1)
|
||
pooled = np.sqrt(((n1-1)*s1 + (n2-1)*s2) / (n1+n2-2))
|
||
cohen_d = (a.mean() - b.mean()) / pooled
|
||
df = (s1/n1 + s2/n2)**2 / ((s1/n1)**2/(n1-1) + (s2/n2)**2/(n2-1))
|
||
se = np.sqrt(s1/n1 + s2/n2)
|
||
crit = stats.t.ppf(0.975, df)
|
||
diff = a.mean() - b.mean()
|
||
return dict(t=t, p=p, d=cohen_d, ci=(diff - crit*se, diff + crit*se))
|
||
```
|
||
|
||
### OLS regression with diagnostics (statsmodels)
|
||
```python
|
||
import statsmodels.api as sm
|
||
import statsmodels.formula.api as smf
|
||
|
||
model = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit(cov_type="HC3")
|
||
print(model.summary())
|
||
|
||
# diagnostics
|
||
from statsmodels.stats.diagnostic import het_breuschpagan
|
||
bp = het_breuschpagan(model.resid, model.model.exog)
|
||
print("Breusch-Pagan p:", bp[1])
|
||
```
|
||
|
||
### Hierarchical Bayesian (PyMC 5.x)
|
||
```python
|
||
import pymc as pm
|
||
import arviz as az
|
||
|
||
with pm.Model() as hier:
|
||
mu_a = pm.Normal("mu_a", 0, 5)
|
||
sigma_a = pm.HalfNormal("sigma_a", 1)
|
||
a = pm.Normal("a", mu_a, sigma_a, shape=n_groups)
|
||
b = pm.Normal("b", 0, 1)
|
||
sigma = pm.HalfNormal("sigma", 1)
|
||
mu = a[group_idx] + b * x
|
||
pm.Normal("y_obs", mu, sigma, observed=y)
|
||
idata = pm.sample(2000, tune=1000, target_accept=0.95)
|
||
|
||
az.plot_trace(idata)
|
||
az.summary(idata, var_names=["mu_a", "sigma_a", "b"])
|
||
```
|
||
|
||
### Bootstrap CI
|
||
```python
|
||
import numpy as np
|
||
def bootstrap_ci(data, stat=np.mean, n=10_000, alpha=0.05, rng=None):
|
||
rng = rng or np.random.default_rng(42)
|
||
boots = stat(rng.choice(data, size=(n, len(data)), replace=True), axis=1)
|
||
lo, hi = np.quantile(boots, [alpha/2, 1-alpha/2])
|
||
return stat(data), (lo, hi)
|
||
```
|
||
|
||
### Multiple testing correction
|
||
```python
|
||
from statsmodels.stats.multitest import multipletests
|
||
reject, pvals_corr, _, _ = multipletests(pvals, alpha=0.05, method="fdr_bh")
|
||
```
|
||
|
||
### Causal inference: doubly robust (EconML / DoubleML)
|
||
```python
|
||
from econml.dml import LinearDML
|
||
from sklearn.ensemble import GradientBoostingRegressor
|
||
|
||
dml = LinearDML(
|
||
model_y=GradientBoostingRegressor(),
|
||
model_t=GradientBoostingRegressor(),
|
||
discrete_treatment=False,
|
||
cv=5,
|
||
)
|
||
dml.fit(Y, T, X=X, W=W)
|
||
print(dml.effect(X), dml.effect_interval(X))
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| 2-group mean compare, normal-ish | Welch's t-test |
|
||
| Non-parametric, small n | Mann-Whitney / permutation |
|
||
| Multi-level data | Mixed-effects (lme4 / statsmodels) |
|
||
| Strong prior, small n | Bayesian (PyMC) |
|
||
| Causal effect from observational | DML / IV / RDD |
|
||
| Many comparisons | FDR (BH), not Bonferroni unless ≤10 tests |
|
||
|
||
**기본값**: statsmodels for frequentist, PyMC 5 + ArviZ for Bayesian, EconML for causal.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Probability Theory]]
|
||
- 변형: [[Bayesian_Inference|Bayesian Inference]] · [[Causal Inference]]
|
||
- Adjacent: [[Machine Learning]] · [[Power Analysis]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: pipeline scaffolding, EDA narrative, model spec translation, plot 코드 생성.
|
||
**언제 X**: numerical p-value computation 직접 — library 사용. 매 LLM의 hallucinated stat 의 X.
|
||
|
||
## ❌ 안티패턴
|
||
- **p-hacking**: 매 multiple test 후 cherry-pick — pre-registration + correction 필수.
|
||
- **CI vs PI 혼동**: confidence interval ≠ prediction interval. 매 명확히 구분.
|
||
- **HARKing**: hypothesis after results — exploratory vs confirmatory 분리.
|
||
- **Naive default prior**: PyMC `Normal(0, 100)` 의 X — domain-informed weakly-informative prior.
|
||
- **n=30 rule**: 매 myth — distribution shape 기반 결정.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Wasserman "All of Statistics", Gelman BDA3, statsmodels docs 0.14+, PyMC 5.x docs).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — frequentist + Bayesian + causal patterns |
|