Files
2nd/10_Wiki/Topics/Computer_Science_and_Theory/Regression-Analysis-Foundations.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.6 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-regression-analysis-foundations Regression Analysis Foundations 10_Wiki/Topics verified self
Linear Regression
OLS
GLM Foundations
Regression
none A 0.93 applied
regression
statistics
ml
foundations
2026-05-10 pending
language framework
python statsmodels/scikit-learn

Regression Analysis Foundations

매 한 줄

"매 conditional expectation 의 functional form 의 estimate". Galton/Pearson 의 origin 의 매 OLS, GLM, regularized, robust, quantile, mixed-effects 의 spectrum — 매 modern ML 의 baseline + interpretability tool.

매 핵심

매 OLS basics

  • Model: y = Xβ + ε, ε ~ 𝓝(0, σ²I).
  • Closed form: β̂ = (XᵀX)⁻¹Xᵀy.
  • Gauss-Markov: 매 BLUE 하 의 1-5 assumptions (linearity, exogeneity, no multicoll, homosked, no autocorr).
  • Inference: t-tests, F-test, R², adj-R².

매 extensions

  • GLM: g(E[y]) = Xβ — logistic, Poisson, Gamma, NB.
  • Regularized: Ridge (L2), Lasso (L1), Elastic Net.
  • Robust: Huber, RANSAC — 매 outlier resistant.
  • Quantile: 매 conditional quantile 의 estimate.
  • Mixed-effects: 매 random + fixed — clustered / hierarchical data.

매 진단

  • Residual plots (linearity, homosked).
  • QQ-plot (normality).
  • VIF (multicollinearity).
  • Cook's distance (influence).
  • Durbin-Watson (autocorrelation).

매 응용

  1. Pricing / demand modeling.
  2. A/B test analysis (regression-adjusted).
  3. Causal inference (with assumptions).
  4. ML baseline before deep models.
  5. Ablation in scientific research.

💻 패턴

Statsmodels OLS with full inference

import statsmodels.api as sm
import pandas as pd

X = sm.add_constant(df[["x1","x2","x3"]])
model = sm.OLS(df["y"], X).fit(cov_type="HC3")   # 매 robust SE
print(model.summary())

Sklearn pipeline (Ridge + CV + scaling)

from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),
                     RidgeCV(alphas=np.logspace(-3, 3, 25), cv=5))
pipe.fit(X, y)
print(pipe[-1].alpha_)

Lasso path / feature selection

from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5, max_iter=20000).fit(X, y)
selected = X.columns[lasso.coef_ != 0]

Logistic regression (GLM)

import statsmodels.api as sm
m = sm.Logit(y, sm.add_constant(X)).fit()
print(m.summary())
print(np.exp(m.params))   # 매 odds ratios

Poisson / Negative Binomial

m_poi = sm.GLM(y, sm.add_constant(X), family=sm.families.Poisson()).fit()
m_nb  = sm.GLM(y, sm.add_constant(X), family=sm.families.NegativeBinomial(alpha=1.0)).fit()
# 매 dispersion check: chi²/df ≈ 1 → Poisson OK.

Quantile regression

m50 = sm.QuantReg(y, sm.add_constant(X)).fit(q=0.5)
m90 = sm.QuantReg(y, sm.add_constant(X)).fit(q=0.9)

Mixed-effects (lme4-style via statsmodels)

import statsmodels.formula.api as smf
m = smf.mixedlm("y ~ x1 + x2", df, groups=df["school_id"]).fit()

Diagnostics — VIF + Cook's distance

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
                index=X.columns)
infl = model.get_influence()
cooks = infl.cooks_distance[0]

Bootstrap CI for coefficients

import numpy as np
def boot(X, y, n=2000, rng=np.random.default_rng(0)):
    coefs = []
    for _ in range(n):
        idx = rng.integers(0, len(y), len(y))
        b = np.linalg.lstsq(X[idx], y[idx], rcond=None)[0]
        coefs.append(b)
    return np.percentile(coefs, [2.5, 97.5], axis=0)

매 결정 기준

상황 Approach
매 small p, inference focus OLS + statsmodels
매 large p, multicollinearity Ridge / Elastic Net
매 sparse feature selection Lasso
매 binary / count outcome Logistic / Poisson / NB
매 outliers / heavy tails Huber / RANSAC / Quantile
매 grouped / nested data Mixed-effects
매 nonlinear smooth GAM (pyGAM)

기본값: 매 statsmodels OLS + HC3 robust SE → diagnostics → escalate (Ridge/GLM/quantile).

🔗 Graph

🤖 LLM 활용

언제: 매 interpretability + uncertainty 의 priority — 매 deep model 보다 first try. 매 baseline establishment. 언제 X: 매 high-dim image / text / audio — 매 deep features 의 superior.

안티패턴

  • Assumptions 미검증: 매 Gauss-Markov 의 violate 한 채 inference 의 trust.
  • R² 추구 over-fit: 매 predictors 의 throw — 매 adjusted-R²/CV 의 use.
  • Standardization 누락 in regularized: 매 penalty 의 unit-dependent.
  • Multicollinearity 무시: 매 unstable coefficients, 매 VIF check.
  • Causal claim from observational regression: 매 confounders without DAG/IV/RD/DiD.

🧪 검증 / 중복

  • Verified (Hastie ESL 2e Ch3; Wooldridge Econometrics 7e; Gelman & Hill Data Analysis Using Regression).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — full OLS/GLM/regularized/robust/quantile/mixed spec