Files
2nd/10_Wiki/Topics/Other/Statistical-Analysis.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

166 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-statistical-analysis
title: Statistical Analysis
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Statistics, Inferential Statistics, Data Analysis]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [statistics, hypothesis-testing, regression, bayesian]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python / R
framework: scipy / statsmodels / pymc / R-tidyverse
---
# Statistical Analysis
## 매 한 줄
> **"매 데이터의 uncertainty 를 정량화"**. FisherNeyman frequentist framework 부터 Gelman 2020s Bayesian workflow까지, 2026 현재 표준은 statsmodels + PyMC 5.x + ArviZ pipeline 으로 reproducible inference를 빌드하는 것이다.
## 매 핵심
### 매 두 paradigm
- **Frequentist**: parameter 는 fixed, data 가 random. p-value, confidence interval, MLE.
- **Bayesian**: parameter 도 random, prior + likelihood → posterior. Credible interval, posterior predictive.
- **2026 합의**: 매 둘 다 도구 — small n / strong prior 면 Bayesian, large n / regulated 면 frequentist.
### 매 핵심 절차
- **EDA**: distribution, missing, outlier, correlation matrix.
- **Hypothesis test**: t-test, χ², Mann-Whitney, permutation. Effect size + CI 동봉.
- **Regression**: OLS → GLM → mixed-effects → hierarchical Bayesian.
- **Model checking**: residual diagnostics, posterior predictive checks, k-fold CV.
### 매 응용
1. A/B test 분석 (web, ML model rollout).
2. Clinical trial efficacy.
3. Causal inference (DiD, IV, RDD, double ML).
4. Risk modeling (insurance, finance).
## 💻 패턴
### Welch's t-test + effect size + CI (scipy 1.13+)
```python
import numpy as np
from scipy import stats
def welch_with_effect(a, b):
t, p = stats.ttest_ind(a, b, equal_var=False)
n1, n2 = len(a), len(b)
s1, s2 = a.var(ddof=1), b.var(ddof=1)
pooled = np.sqrt(((n1-1)*s1 + (n2-1)*s2) / (n1+n2-2))
cohen_d = (a.mean() - b.mean()) / pooled
df = (s1/n1 + s2/n2)**2 / ((s1/n1)**2/(n1-1) + (s2/n2)**2/(n2-1))
se = np.sqrt(s1/n1 + s2/n2)
crit = stats.t.ppf(0.975, df)
diff = a.mean() - b.mean()
return dict(t=t, p=p, d=cohen_d, ci=(diff - crit*se, diff + crit*se))
```
### OLS regression with diagnostics (statsmodels)
```python
import statsmodels.api as sm
import statsmodels.formula.api as smf
model = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit(cov_type="HC3")
print(model.summary())
# diagnostics
from statsmodels.stats.diagnostic import het_breuschpagan
bp = het_breuschpagan(model.resid, model.model.exog)
print("Breusch-Pagan p:", bp[1])
```
### Hierarchical Bayesian (PyMC 5.x)
```python
import pymc as pm
import arviz as az
with pm.Model() as hier:
mu_a = pm.Normal("mu_a", 0, 5)
sigma_a = pm.HalfNormal("sigma_a", 1)
a = pm.Normal("a", mu_a, sigma_a, shape=n_groups)
b = pm.Normal("b", 0, 1)
sigma = pm.HalfNormal("sigma", 1)
mu = a[group_idx] + b * x
pm.Normal("y_obs", mu, sigma, observed=y)
idata = pm.sample(2000, tune=1000, target_accept=0.95)
az.plot_trace(idata)
az.summary(idata, var_names=["mu_a", "sigma_a", "b"])
```
### Bootstrap CI
```python
import numpy as np
def bootstrap_ci(data, stat=np.mean, n=10_000, alpha=0.05, rng=None):
rng = rng or np.random.default_rng(42)
boots = stat(rng.choice(data, size=(n, len(data)), replace=True), axis=1)
lo, hi = np.quantile(boots, [alpha/2, 1-alpha/2])
return stat(data), (lo, hi)
```
### Multiple testing correction
```python
from statsmodels.stats.multitest import multipletests
reject, pvals_corr, _, _ = multipletests(pvals, alpha=0.05, method="fdr_bh")
```
### Causal inference: doubly robust (EconML / DoubleML)
```python
from econml.dml import LinearDML
from sklearn.ensemble import GradientBoostingRegressor
dml = LinearDML(
model_y=GradientBoostingRegressor(),
model_t=GradientBoostingRegressor(),
discrete_treatment=False,
cv=5,
)
dml.fit(Y, T, X=X, W=W)
print(dml.effect(X), dml.effect_interval(X))
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| 2-group mean compare, normal-ish | Welch's t-test |
| Non-parametric, small n | Mann-Whitney / permutation |
| Multi-level data | Mixed-effects (lme4 / statsmodels) |
| Strong prior, small n | Bayesian (PyMC) |
| Causal effect from observational | DML / IV / RDD |
| Many comparisons | FDR (BH), not Bonferroni unless ≤10 tests |
**기본값**: statsmodels for frequentist, PyMC 5 + ArviZ for Bayesian, EconML for causal.
## 🔗 Graph
- 부모: [[Probability Theory]]
- 변형: [[Bayesian_Inference|Bayesian Inference]] · [[Causal Inference]]
- Adjacent: [[Machine Learning]] · [[Power Analysis]]
## 🤖 LLM 활용
**언제**: pipeline scaffolding, EDA narrative, model spec translation, plot 코드 생성.
**언제 X**: numerical p-value computation 직접 — library 사용. 매 LLM의 hallucinated stat 의 X.
## ❌ 안티패턴
- **p-hacking**: 매 multiple test 후 cherry-pick — pre-registration + correction 필수.
- **CI vs PI 혼동**: confidence interval ≠ prediction interval. 매 명확히 구분.
- **HARKing**: hypothesis after results — exploratory vs confirmatory 분리.
- **Naive default prior**: PyMC `Normal(0, 100)` 의 X — domain-informed weakly-informative prior.
- **n=30 rule**: 매 myth — distribution shape 기반 결정.
## 🧪 검증 / 중복
- Verified (Wasserman "All of Statistics", Gelman BDA3, statsmodels docs 0.14+, PyMC 5.x docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — frequentist + Bayesian + causal patterns |