--- id: wiki-2026-0508-regression-analysis-foundations title: Regression Analysis Foundations category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Linear Regression, OLS, GLM Foundations, Regression] duplicate_of: none source_trust_level: A confidence_score: 0.93 verification_status: applied tags: [regression, statistics, ml, foundations] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: statsmodels/scikit-learn --- # Regression Analysis Foundations ## 매 한 줄 > **"매 conditional expectation 의 functional form 의 estimate"**. Galton/Pearson 의 origin 의 매 OLS, GLM, regularized, robust, quantile, mixed-effects 의 spectrum — 매 modern ML 의 baseline + interpretability tool. ## 매 핵심 ### 매 OLS basics - **Model**: y = Xβ + ε, ε ~ 𝓝(0, σ²I). - **Closed form**: β̂ = (XᵀX)⁻¹Xᵀy. - **Gauss-Markov**: 매 BLUE 하 의 1-5 assumptions (linearity, exogeneity, no multicoll, homosked, no autocorr). - **Inference**: t-tests, F-test, R², adj-R². ### 매 extensions - **GLM**: g(E[y]) = Xβ — logistic, Poisson, Gamma, NB. - **Regularized**: Ridge (L2), Lasso (L1), Elastic Net. - **Robust**: Huber, RANSAC — 매 outlier resistant. - **Quantile**: 매 conditional quantile 의 estimate. - **Mixed-effects**: 매 random + fixed — clustered / hierarchical data. ### 매 진단 - Residual plots (linearity, homosked). - QQ-plot (normality). - VIF (multicollinearity). - Cook's distance (influence). - Durbin-Watson (autocorrelation). ### 매 응용 1. Pricing / demand modeling. 2. A/B test analysis (regression-adjusted). 3. Causal inference (with assumptions). 4. ML baseline before deep models. 5. Ablation in scientific research. ## 💻 패턴 ### Statsmodels OLS with full inference ```python import statsmodels.api as sm import pandas as pd X = sm.add_constant(df[["x1","x2","x3"]]) model = sm.OLS(df["y"], X).fit(cov_type="HC3") # 매 robust SE print(model.summary()) ``` ### Sklearn pipeline (Ridge + CV + scaling) ```python from sklearn.linear_model import RidgeCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), RidgeCV(alphas=np.logspace(-3, 3, 25), cv=5)) pipe.fit(X, y) print(pipe[-1].alpha_) ``` ### Lasso path / feature selection ```python from sklearn.linear_model import LassoCV lasso = LassoCV(cv=5, max_iter=20000).fit(X, y) selected = X.columns[lasso.coef_ != 0] ``` ### Logistic regression (GLM) ```python import statsmodels.api as sm m = sm.Logit(y, sm.add_constant(X)).fit() print(m.summary()) print(np.exp(m.params)) # 매 odds ratios ``` ### Poisson / Negative Binomial ```python m_poi = sm.GLM(y, sm.add_constant(X), family=sm.families.Poisson()).fit() m_nb = sm.GLM(y, sm.add_constant(X), family=sm.families.NegativeBinomial(alpha=1.0)).fit() # 매 dispersion check: chi²/df ≈ 1 → Poisson OK. ``` ### Quantile regression ```python m50 = sm.QuantReg(y, sm.add_constant(X)).fit(q=0.5) m90 = sm.QuantReg(y, sm.add_constant(X)).fit(q=0.9) ``` ### Mixed-effects (lme4-style via statsmodels) ```python import statsmodels.formula.api as smf m = smf.mixedlm("y ~ x1 + x2", df, groups=df["school_id"]).fit() ``` ### Diagnostics — VIF + Cook's distance ```python from statsmodels.stats.outliers_influence import variance_inflation_factor vif = pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns) infl = model.get_influence() cooks = infl.cooks_distance[0] ``` ### Bootstrap CI for coefficients ```python import numpy as np def boot(X, y, n=2000, rng=np.random.default_rng(0)): coefs = [] for _ in range(n): idx = rng.integers(0, len(y), len(y)) b = np.linalg.lstsq(X[idx], y[idx], rcond=None)[0] coefs.append(b) return np.percentile(coefs, [2.5, 97.5], axis=0) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | 매 small p, inference focus | OLS + statsmodels | | 매 large p, multicollinearity | Ridge / Elastic Net | | 매 sparse feature selection | Lasso | | 매 binary / count outcome | Logistic / Poisson / NB | | 매 outliers / heavy tails | Huber / RANSAC / Quantile | | 매 grouped / nested data | Mixed-effects | | 매 nonlinear smooth | GAM (pyGAM) | **기본값**: 매 statsmodels OLS + HC3 robust SE → diagnostics → escalate (Ridge/GLM/quantile). ## 🔗 Graph - 부모: [[Statistics]] · [[Probability Theory]] · [[Linear-Algebra-Foundations]] - 변형: [[Ridge-Regression]] · [[Logistic Regression]] - 응용: [[Multivariate-Analysis]] · [[Statistical-Power]] · [[Decision Theory]] - Adjacent: [[Least-Squares-Methods]] · [[Expectation-Maximization]] · [[Kernel-Density-Estimation-KDE]] ## 🤖 LLM 활용 **언제**: 매 interpretability + uncertainty 의 priority — 매 deep model 보다 first try. 매 baseline establishment. **언제 X**: 매 high-dim image / text / audio — 매 deep features 의 superior. ## ❌ 안티패턴 - **Assumptions 미검증**: 매 Gauss-Markov 의 violate 한 채 inference 의 trust. - **R² 추구 over-fit**: 매 predictors 의 throw — 매 adjusted-R²/CV 의 use. - **Standardization 누락 in regularized**: 매 penalty 의 unit-dependent. - **Multicollinearity 무시**: 매 unstable coefficients, 매 VIF check. - **Causal claim from observational regression**: 매 confounders without DAG/IV/RD/DiD. ## 🧪 검증 / 중복 - Verified (Hastie *ESL* 2e Ch3; Wooldridge *Econometrics* 7e; Gelman & Hill *Data Analysis Using Regression*). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — full OLS/GLM/regularized/robust/quantile/mixed spec |