--- id: wiki-2026-0508-statistics-data-analysis title: "Statistics & Data Analysis" category: 10_Wiki/Topics status: verified canonical_id: self aliases: [stats, data analysis, applied statistics] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [statistics, data-analysis, ab-testing, ml, observability] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: numpy-scipy-statsmodels-pymc --- # Statistics & Data Analysis ## 매 한 줄 > **"매 data 의 lying 의 — 매 stats 의 catching"**. Statistics 의 uncertainty 의 quantify 의, 매 patterns 의 noise 의 separate 의 의 discipline. 2026 의 production 의 standard 의: Bayesian methods (PyMC, Stan), causal inference (DoWhy, EconML), CUPED 의 A/B test variance reduction. ## 매 핵심 ### 매 핵심 dichotomy - **Frequentist**: p-values, confidence intervals — 매 long-run frequency 의. - **Bayesian**: posteriors, credible intervals — 매 belief update 의. - **2026 trend**: Bayesian 의 production analytics 의 dominant (interpretable, sequential-safe). ### 매 must-know toolkit - **Hypothesis tests**: t-test, Mann-Whitney, χ², Fisher exact. - **Regression**: OLS, GLM (logistic, Poisson), mixed-effects. - **Causal**: difference-in-differences, IV, RDD, synthetic control. - **A/B**: CUPED, sequential testing (mSPRT), multi-armed bandits. ### 매 응용 1. Product A/B testing (CUPED + sequential). 2. SRE — anomaly detection on metrics. 3. SAST/SCA findings 의 risk scoring (Bayesian prior). ## 💻 패턴 ### Welch t-test (A/B) ```python import numpy as np from scipy import stats control = np.array([...]) treatment = np.array([...]) t, p = stats.ttest_ind(control, treatment, equal_var=False) ci = stats.t.interval(0.95, len(control)+len(treatment)-2, loc=treatment.mean()-control.mean(), scale=stats.sem(np.concatenate([control, treatment]))) print(f"Δ={treatment.mean()-control.mean():.4f}, p={p:.4f}, 95%CI={ci}") ``` ### CUPED variance reduction ```python import numpy as np def cuped_adjust(y_pre, y_post): theta = np.cov(y_pre, y_post)[0,1] / np.var(y_pre) return y_post - theta * (y_pre - y_pre.mean()) y_adj_c = cuped_adjust(pre_c, post_c) y_adj_t = cuped_adjust(pre_t, post_t) ``` ### Bayesian A/B (PyMC) ```python import pymc as pm with pm.Model() as m: p_a = pm.Beta('p_a', 1, 1) p_b = pm.Beta('p_b', 1, 1) pm.Binomial('obs_a', n=n_a, p=p_a, observed=k_a) pm.Binomial('obs_b', n=n_b, p=p_b, observed=k_b) pm.Deterministic('lift', (p_b - p_a) / p_a) idata = pm.sample(2000, tune=1000) print(f"P(B>A) = {(idata.posterior['lift']>0).mean().item():.3f}") ``` ### Sequential testing (mSPRT) ```python import numpy as np def msprt(x, y, sigma2_tau=0.01, alpha=0.05): n = min(len(x), len(y)) delta = y[:n] - x[:n] s2 = delta.var(ddof=1) t = delta.mean() * np.sqrt(n) lr = np.sqrt(s2/(s2+n*sigma2_tau)) * np.exp( n*sigma2_tau*t**2 / (2*s2*(s2+n*sigma2_tau))) return lr > 1/alpha ``` ### Causal — difference-in-differences (statsmodels) ```python import statsmodels.formula.api as smf m = smf.ols('y ~ treated * post + C(unit) + C(time)', data=df).fit( cov_type='cluster', cov_kwds={'groups': df['unit']}) print(m.params['treated:post']) ``` ### Anomaly — robust z (MAD) ```python import numpy as np def mad_z(x): med = np.median(x) mad = np.median(np.abs(x - med)) return 0.6745 * (x - med) / (mad + 1e-9) anomalies = np.abs(mad_z(latency_p99)) > 3.5 ``` ## 매 결정 기준 | 상황 | Method | |---|---| | 2-arm online experiment, fixed N | Welch t-test + CUPED | | sequential / peeking 위험 | mSPRT or Bayesian | | many arms, exploration value | Thompson sampling bandit | | observational, treatment effect | DiD / IV / synthetic control | | heavy-tailed (revenue) | Mann-Whitney + bootstrap CI | **기본값**: Welch + CUPED for online A/B; Bayesian for small-N or peeking; bootstrap for non-Gaussian. ## 🔗 Graph - 부모: [[Probability Theory]] - 변형: [[Bayesian Statistics]] · [[Causal Inference]] - 응용: [[Anomaly Detection]] · [[ML Evaluation]] - Adjacent: [[PyMC]] ## 🤖 LLM 활용 **언제**: experiment design review, p-value 해석, choosing test for distribution shape, generating PyMC models from descriptions. **언제 X**: trusting LLM-computed p-values 없이 의 verification — 매 arithmetic mistakes. ## ❌ 안티패턴 - **Peeking**: 매 fixed-N test 의 daily check 의 stop — 매 false positive rate 의 5% → 30%+. - **HARKing**: 매 hypothesis after results known. - **p<0.05 worship**: 매 effect size 무시. - **Ignoring multiple testing**: 매 20 metrics 의 →약 1 의 false positive 의 expected. - **CUPED 의 covariate 의 post-treatment 의**: 매 invalidates. ## 🧪 검증 / 중복 - Verified (Microsoft CUPED paper 2013, Optimizely Stats Engine, Gelman BDA3, Wasserman All of Stats). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — A/B + Bayesian + causal patterns |