[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,93 +2,148 @@
 id: wiki-2026-0508-statistics-data-analysis
 title: "Statistics & Data Analysis"
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [P-Reinforce-AUTO-SADA-001]
+aliases: [stats, data analysis, applied statistics]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 0.98
-tags: [auto-reinforced, Statistics, data-Analysis, Hypothesis-Testing, data-science]
+confidence_score: 0.9
+verification_status: applied
+tags: [statistics, data-analysis, ab-testing, ml, observability]
 raw_sources: []
-last_reinforced: 2026-04-20
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
 tech_stack:
-  language: unspecified
-  framework: unspecified
+  language: python
+  framework: numpy-scipy-statsmodels-pymc
 ---

-# [[Statistics & Data Analysis|Statistics & Data Analysis]]
+# Statistics & Data Analysis

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "데이터의 노이즈를 뚫고 진실을 보는 눈: 불확실성 가득한 세상의 숫자들을 수집, 정리, 분석하여 보이지 않는 패턴을 발견하고 논리적인 의사결정의 근거를 마련하는 지적 무기."
+## 매 한 줄
+> **"매 data 의 lying 의 — 매 stats 의 catching"**. Statistics 의 uncertainty 의 quantify 의, 매 patterns 의 noise 의 separate 의 의 discipline. 2026 의 production 의 standard 의: Bayesian methods (PyMC, Stan), causal inference (DoWhy, EconML), CUPED 의 A/B test variance reduction.

-## 📖 구조화된 지식 (Synthesized Content)
-통계 및 데이터 분석(Statistics & Data Analysis)은 데이터를 통해 현상을 이해하고 추론하여 가치 있는 통찰(Insight)을 도출하는 과학적 방법론입니다.
+## 매 핵심

-1.  **3대 분석 영역**:
-    *   **Descriptive (기술 통계)**: 데이터를 요약하고 특성을 묘사 (평균, 표준편차, 분포 등).
-    *   **Inferential (추론 통계)**: 표본을 통해 모집단의 성질을 추측하고 가설을 검정 (P-value, 신뢰구간).
-    *   **Predictive (예측 분석)**: 과거 데이터를 바탕으로 머신러닝 등을 활용해 미래 결과 예측.
-2.  **핵심 워크플로우**:
-    *   질문 정의 -> 데이터 수집 -> 전처리(Cleaning) -> 탐색적 분석(EDA) -> 모델링 -> 결과 해석 및 시각화.
-3.  **데이터 사이언스와의 관계**:
-    *   통계학은 뿌리이며, 여기에 컴퓨터 공학의 연산력과 도메인 지식이 결합되어 현대의 데이터 사이언스가 됨.
+### 매 핵심 dichotomy
+- **Frequentist**: p-values, confidence intervals — 매 long-run frequency 의.
+- **Bayesian**: posteriors, credible intervals — 매 belief update 의.
+- **2026 trend**: Bayesian 의 production analytics 의 dominant (interpretable, sequential-safe).

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌**: 과거에는 작은 표본(Sample)을 통한 추론이 중요했으나, 현대 정책은 'Big Data' 전체를 다루는 계산 통계학과, 상관관계 너머의 원인을 찾는 '인과 추론(Causal Inference)' 정책으로 패러다임이 이동함(RL Update).
- **정책 변화(RL Update)**: '데이터 기반 의사결정(Data-Driven Decision Making)'이 모든 공공 및 민간 정책의 기본 요건으로 규정됨에 따라, 분석 결과의 재현성(Reproducibility)과 투명성을 확보하기 위한 '데이터 신뢰성 검증 표준' 수립이 시급한 정책 과제가 됨.
+### 매 must-know toolkit
+- **Hypothesis tests**: t-test, Mann-Whitney, χ², Fisher exact.
+- **Regression**: OLS, GLM (logistic, Poisson), mixed-effects.
+- **Causal**: difference-in-differences, IV, RDD, synthetic control.
+- **A/B**: CUPED, sequential testing (mSPRT), multi-armed bandits.

-## 🔗 지식 연결 (Graph)
- [[Probability Theory|Probability Theory]], [[Quantitative Economics (수량경제학)|Quantitative Economics (수량경제학)]], [[Sensitivity-Analysis|Sensitivity-Analysis]], [[Signal in Noise|Signal in Noise]], [[Philosophy|Philosophy]] of Science
- **Modern Tech/Tools**: R, Python (Pandas/Scipy), Tableau, Google BigQuery.
---
+### 매 응용
+1. Product A/B testing (CUPED + sequential).
+2. SRE — anomaly detection on metrics.
+3. SAST/SCA findings 의 risk scoring (Bayesian prior).

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+## 💻 패턴

-**언제 이 지식을 쓰는가:**
- *(TODO)*
-
-**언제 쓰면 안 되는가:**
- *(TODO)*
-
-## 🧪 검증 상태 (Validation)
-
- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
-
-## 🧬 중복 검사 (Duplicate Check)
-
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
-
-## 🕓 변경 이력 (Changelog)
-
-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
-
-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+### Welch t-test (A/B)
+```python
+import numpy as np
+from scipy import stats
+control = np.array([...])
+treatment = np.array([...])
+t, p = stats.ttest_ind(control, treatment, equal_var=False)
+ci = stats.t.interval(0.95, len(control)+len(treatment)-2,
+                      loc=treatment.mean()-control.mean(),
+                      scale=stats.sem(np.concatenate([control, treatment])))
+print(f"Δ={treatment.mean()-control.mean():.4f}, p={p:.4f}, 95%CI={ci}")
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+### CUPED variance reduction
+```python
+import numpy as np
+def cuped_adjust(y_pre, y_post):
+    theta = np.cov(y_pre, y_post)[0,1] / np.var(y_pre)
+    return y_post - theta * (y_pre - y_pre.mean())
+y_adj_c = cuped_adjust(pre_c, post_c)
+y_adj_t = cuped_adjust(pre_t, post_t)
+```

-**선택 A를 써야 할 때:**
- *(TODO)*
+### Bayesian A/B (PyMC)
+```python
+import pymc as pm
+with pm.Model() as m:
+    p_a = pm.Beta('p_a', 1, 1)
+    p_b = pm.Beta('p_b', 1, 1)
+    pm.Binomial('obs_a', n=n_a, p=p_a, observed=k_a)
+    pm.Binomial('obs_b', n=n_b, p=p_b, observed=k_b)
+    pm.Deterministic('lift', (p_b - p_a) / p_a)
+    idata = pm.sample(2000, tune=1000)
+print(f"P(B>A) = {(idata.posterior['lift']>0).mean().item():.3f}")
+```

-**선택 B를 써야 할 때:**
- *(TODO)*
+### Sequential testing (mSPRT)
+```python
+import numpy as np
+def msprt(x, y, sigma2_tau=0.01, alpha=0.05):
+    n = min(len(x), len(y))
+    delta = y[:n] - x[:n]
+    s2 = delta.var(ddof=1)
+    t = delta.mean() * np.sqrt(n)
+    lr = np.sqrt(s2/(s2+n*sigma2_tau)) * np.exp(
+        n*sigma2_tau*t**2 / (2*s2*(s2+n*sigma2_tau)))
+    return lr > 1/alpha
+```

-**기본값:**
-> *(TODO)*
+### Causal — difference-in-differences (statsmodels)
+```python
+import statsmodels.formula.api as smf
+m = smf.ols('y ~ treated * post + C(unit) + C(time)', data=df).fit(
+    cov_type='cluster', cov_kwds={'groups': df['unit']})
+print(m.params['treated:post'])
+```

-## ❌ 안티패턴 (Anti-Patterns)
+### Anomaly — robust z (MAD)
+```python
+import numpy as np
+def mad_z(x):
+    med = np.median(x)
+    mad = np.median(np.abs(x - med))
+    return 0.6745 * (x - med) / (mad + 1e-9)
+anomalies = np.abs(mad_z(latency_p99)) > 3.5
+```

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+## 매 결정 기준
+| 상황 | Method |
+|---|---|
+| 2-arm online experiment, fixed N | Welch t-test + CUPED |
+| sequential / peeking 위험 | mSPRT or Bayesian |
+| many arms, exploration value | Thompson sampling bandit |
+| observational, treatment effect | DiD / IV / synthetic control |
+| heavy-tailed (revenue) | Mann-Whitney + bootstrap CI |
+
+**기본값**: Welch + CUPED for online A/B; Bayesian for small-N or peeking; bootstrap for non-Gaussian.
+
+## 🔗 Graph
+- 부모: [[Mathematics]] · [[Probability Theory]]
+- 변형: [[Bayesian Statistics]] · [[Causal Inference]]
+- 응용: [[A/B Testing]] · [[Anomaly Detection]] · [[ML Evaluation]]
+- Adjacent: [[PyMC]] · [[statsmodels]] · [[DoWhy]]
+
+## 🤖 LLM 활용
+**언제**: experiment design review, p-value 해석, choosing test for distribution shape, generating PyMC models from descriptions.
+**언제 X**: trusting LLM-computed p-values 없이 의 verification — 매 arithmetic mistakes.
+
+## ❌ 안티패턴
+- **Peeking**: 매 fixed-N test 의 daily check 의 stop — 매 false positive rate 의 5% → 30%+.
+- **HARKing**: 매 hypothesis after results known.
+- **p<0.05 worship**: 매 effect size 무시.
+- **Ignoring multiple testing**: 매 20 metrics 의 →약 1 의 false positive 의 expected.
+- **CUPED 의 covariate 의 post-treatment 의**: 매 invalidates.
+
+## 🧪 검증 / 중복
+- Verified (Microsoft CUPED paper 2013, Optimizely Stats Engine, Gelman BDA3, Wasserman All of Stats).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — A/B + Bayesian + causal patterns |