--- id: wiki-2026-0508-statistical-analysis title: Statistical Analysis category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Statistics, Inferential Statistics, Data Analysis] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [statistics, hypothesis-testing, regression, bayesian] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python / R framework: scipy / statsmodels / pymc / R-tidyverse --- # Statistical Analysis ## 매 한 줄 > **"매 데이터의 uncertainty 를 정량화"**. Fisher–Neyman frequentist framework 부터 Gelman 2020s Bayesian workflow까지, 2026 현재 표준은 statsmodels + PyMC 5.x + ArviZ pipeline 으로 reproducible inference를 빌드하는 것이다. ## 매 핵심 ### 매 두 paradigm - **Frequentist**: parameter 는 fixed, data 가 random. p-value, confidence interval, MLE. - **Bayesian**: parameter 도 random, prior + likelihood → posterior. Credible interval, posterior predictive. - **2026 합의**: 매 둘 다 도구 — small n / strong prior 면 Bayesian, large n / regulated 면 frequentist. ### 매 핵심 절차 - **EDA**: distribution, missing, outlier, correlation matrix. - **Hypothesis test**: t-test, χ², Mann-Whitney, permutation. Effect size + CI 동봉. - **Regression**: OLS → GLM → mixed-effects → hierarchical Bayesian. - **Model checking**: residual diagnostics, posterior predictive checks, k-fold CV. ### 매 응용 1. A/B test 분석 (web, ML model rollout). 2. Clinical trial efficacy. 3. Causal inference (DiD, IV, RDD, double ML). 4. Risk modeling (insurance, finance). ## 💻 패턴 ### Welch's t-test + effect size + CI (scipy 1.13+) ```python import numpy as np from scipy import stats def welch_with_effect(a, b): t, p = stats.ttest_ind(a, b, equal_var=False) n1, n2 = len(a), len(b) s1, s2 = a.var(ddof=1), b.var(ddof=1) pooled = np.sqrt(((n1-1)*s1 + (n2-1)*s2) / (n1+n2-2)) cohen_d = (a.mean() - b.mean()) / pooled df = (s1/n1 + s2/n2)**2 / ((s1/n1)**2/(n1-1) + (s2/n2)**2/(n2-1)) se = np.sqrt(s1/n1 + s2/n2) crit = stats.t.ppf(0.975, df) diff = a.mean() - b.mean() return dict(t=t, p=p, d=cohen_d, ci=(diff - crit*se, diff + crit*se)) ``` ### OLS regression with diagnostics (statsmodels) ```python import statsmodels.api as sm import statsmodels.formula.api as smf model = smf.ols("y ~ x1 + x2 + C(group)", data=df).fit(cov_type="HC3") print(model.summary()) # diagnostics from statsmodels.stats.diagnostic import het_breuschpagan bp = het_breuschpagan(model.resid, model.model.exog) print("Breusch-Pagan p:", bp[1]) ``` ### Hierarchical Bayesian (PyMC 5.x) ```python import pymc as pm import arviz as az with pm.Model() as hier: mu_a = pm.Normal("mu_a", 0, 5) sigma_a = pm.HalfNormal("sigma_a", 1) a = pm.Normal("a", mu_a, sigma_a, shape=n_groups) b = pm.Normal("b", 0, 1) sigma = pm.HalfNormal("sigma", 1) mu = a[group_idx] + b * x pm.Normal("y_obs", mu, sigma, observed=y) idata = pm.sample(2000, tune=1000, target_accept=0.95) az.plot_trace(idata) az.summary(idata, var_names=["mu_a", "sigma_a", "b"]) ``` ### Bootstrap CI ```python import numpy as np def bootstrap_ci(data, stat=np.mean, n=10_000, alpha=0.05, rng=None): rng = rng or np.random.default_rng(42) boots = stat(rng.choice(data, size=(n, len(data)), replace=True), axis=1) lo, hi = np.quantile(boots, [alpha/2, 1-alpha/2]) return stat(data), (lo, hi) ``` ### Multiple testing correction ```python from statsmodels.stats.multitest import multipletests reject, pvals_corr, _, _ = multipletests(pvals, alpha=0.05, method="fdr_bh") ``` ### Causal inference: doubly robust (EconML / DoubleML) ```python from econml.dml import LinearDML from sklearn.ensemble import GradientBoostingRegressor dml = LinearDML( model_y=GradientBoostingRegressor(), model_t=GradientBoostingRegressor(), discrete_treatment=False, cv=5, ) dml.fit(Y, T, X=X, W=W) print(dml.effect(X), dml.effect_interval(X)) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 2-group mean compare, normal-ish | Welch's t-test | | Non-parametric, small n | Mann-Whitney / permutation | | Multi-level data | Mixed-effects (lme4 / statsmodels) | | Strong prior, small n | Bayesian (PyMC) | | Causal effect from observational | DML / IV / RDD | | Many comparisons | FDR (BH), not Bonferroni unless ≤10 tests | **기본값**: statsmodels for frequentist, PyMC 5 + ArviZ for Bayesian, EconML for causal. ## 🔗 Graph - 부모: [[Probability Theory]] - 변형: [[Bayesian_Inference|Bayesian Inference]] · [[Causal Inference]] - Adjacent: [[Machine Learning]] · [[Power Analysis]] ## 🤖 LLM 활용 **언제**: pipeline scaffolding, EDA narrative, model spec translation, plot 코드 생성. **언제 X**: numerical p-value computation 직접 — library 사용. 매 LLM의 hallucinated stat 의 X. ## ❌ 안티패턴 - **p-hacking**: 매 multiple test 후 cherry-pick — pre-registration + correction 필수. - **CI vs PI 혼동**: confidence interval ≠ prediction interval. 매 명확히 구분. - **HARKing**: hypothesis after results — exploratory vs confirmatory 분리. - **Naive default prior**: PyMC `Normal(0, 100)` 의 X — domain-informed weakly-informative prior. - **n=30 rule**: 매 myth — distribution shape 기반 결정. ## 🧪 검증 / 중복 - Verified (Wasserman "All of Statistics", Gelman BDA3, statsmodels docs 0.14+, PyMC 5.x docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — frequentist + Bayesian + causal patterns |