Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

5.6 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Regression Analysis Foundations

매 한 줄

"매 conditional expectation 의 functional form 의 estimate". Galton/Pearson 의 origin 의 매 OLS, GLM, regularized, robust, quantile, mixed-effects 의 spectrum — 매 modern ML 의 baseline + interpretability tool.

매 핵심

매 OLS basics

Model: y = Xβ + ε, ε ~ 𝓝(0, σ²I).
Closed form: β̂ = (XᵀX)⁻¹Xᵀy.
Gauss-Markov: 매 BLUE 하 의 1-5 assumptions (linearity, exogeneity, no multicoll, homosked, no autocorr).
Inference: t-tests, F-test, R², adj-R².

매 extensions

GLM: g(E[y]) = Xβ — logistic, Poisson, Gamma, NB.
Regularized: Ridge (L2), Lasso (L1), Elastic Net.
Robust: Huber, RANSAC — 매 outlier resistant.
Quantile: 매 conditional quantile 의 estimate.
Mixed-effects: 매 random + fixed — clustered / hierarchical data.

매 진단

Residual plots (linearity, homosked).
QQ-plot (normality).
VIF (multicollinearity).
Cook's distance (influence).
Durbin-Watson (autocorrelation).

매 응용

Pricing / demand modeling.
A/B test analysis (regression-adjusted).
Causal inference (with assumptions).
ML baseline before deep models.
Ablation in scientific research.

💻 패턴

Statsmodels OLS with full inference

import statsmodels.api as sm
import pandas as pd

X = sm.add_constant(df[["x1","x2","x3"]])
model = sm.OLS(df["y"], X).fit(cov_type="HC3")   # 매 robust SE
print(model.summary())

Sklearn pipeline (Ridge + CV + scaling)

from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),
                     RidgeCV(alphas=np.logspace(-3, 3, 25), cv=5))
pipe.fit(X, y)
print(pipe[-1].alpha_)

Lasso path / feature selection

from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5, max_iter=20000).fit(X, y)
selected = X.columns[lasso.coef_ != 0]

Logistic regression (GLM)

import statsmodels.api as sm
m = sm.Logit(y, sm.add_constant(X)).fit()
print(m.summary())
print(np.exp(m.params))   # 매 odds ratios

Poisson / Negative Binomial

m_poi = sm.GLM(y, sm.add_constant(X), family=sm.families.Poisson()).fit()
m_nb  = sm.GLM(y, sm.add_constant(X), family=sm.families.NegativeBinomial(alpha=1.0)).fit()
# 매 dispersion check: chi²/df ≈ 1 → Poisson OK.

Quantile regression

m50 = sm.QuantReg(y, sm.add_constant(X)).fit(q=0.5)
m90 = sm.QuantReg(y, sm.add_constant(X)).fit(q=0.9)

Mixed-effects (lme4-style via statsmodels)

import statsmodels.formula.api as smf
m = smf.mixedlm("y ~ x1 + x2", df, groups=df["school_id"]).fit()

Diagnostics — VIF + Cook's distance

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
                index=X.columns)
infl = model.get_influence()
cooks = infl.cooks_distance[0]

Bootstrap CI for coefficients

import numpy as np
def boot(X, y, n=2000, rng=np.random.default_rng(0)):
    coefs = []
    for _ in range(n):
        idx = rng.integers(0, len(y), len(y))
        b = np.linalg.lstsq(X[idx], y[idx], rcond=None)[0]
        coefs.append(b)
    return np.percentile(coefs, [2.5, 97.5], axis=0)

매 결정 기준

상황	Approach
매 small p, inference focus	OLS + statsmodels
매 large p, multicollinearity	Ridge / Elastic Net
매 sparse feature selection	Lasso
매 binary / count outcome	Logistic / Poisson / NB
매 outliers / heavy tails	Huber / RANSAC / Quantile
매 grouped / nested data	Mixed-effects
매 nonlinear smooth	GAM (pyGAM)

기본값: 매 statsmodels OLS + HC3 robust SE → diagnostics → escalate (Ridge/GLM/quantile).

🔗 Graph

부모: Statistics · Probability Theory · Linear-Algebra-Foundations
변형: Ridge-Regression · Logistic Regression
응용: Multivariate-Analysis · Statistical-Power · Decision Theory
Adjacent: Least-Squares-Methods · Expectation-Maximization · Kernel-Density-Estimation-KDE

🤖 LLM 활용

언제: 매 interpretability + uncertainty 의 priority — 매 deep model 보다 first try. 매 baseline establishment. 언제 X: 매 high-dim image / text / audio — 매 deep features 의 superior.

❌ 안티패턴

Assumptions 미검증: 매 Gauss-Markov 의 violate 한 채 inference 의 trust.
R² 추구 over-fit: 매 predictors 의 throw — 매 adjusted-R²/CV 의 use.
Standardization 누락 in regularized: 매 penalty 의 unit-dependent.
Multicollinearity 무시: 매 unstable coefficients, 매 VIF check.
Causal claim from observational regression: 매 confounders without DAG/IV/RD/DiD.

🧪 검증 / 중복

Verified (Hastie ESL 2e Ch3; Wooldridge Econometrics 7e; Gelman & Hill Data Analysis Using Regression).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — full OLS/GLM/regularized/robust/quantile/mixed spec

5.6 KiB Raw Blame History Unescape Escape