Files
2nd/10_Wiki/Topics/AI_and_ML/Linear-Regression-Mastery.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

187 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-linear-regression-mastery
title: Linear Regression Mastery
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Linear Regression, OLS, Ordinary Least Squares, Ridge, Lasso]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [machine-learning, regression, statistics, sklearn, ols, ridge, lasso]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: scikit-learn/statsmodels
---
# Linear Regression Mastery
## 매 한 줄
> **"매 모든 ML의 출발점 — y = Xβ + ε"**. Linear regression은 feature와 target의 선형 관계를 OLS로 추정하는 모델. 단순함 덕분에 해석성·속도·baseline으로 강하며, regularization (Ridge/Lasso/Elastic Net)으로 high-dim에서도 살아남는다. 2026 시대에도 production tabular ML의 절반은 여전히 linear.
## 매 핵심
### 매 OLS 수식
- 모델: $y = X\beta + \varepsilon$.
- 목적: $\min_\beta \|y - X\beta\|^2$.
- 닫힌해: $\hat\beta = (X^TX)^{-1}X^Ty$ (X full-rank일 때).
- 기하적: y를 column space of X에 projection.
### 매 가정 (LINE)
- **L**inearity: y와 X의 관계가 선형.
- **I**ndependence: 잔차 i.i.d.
- **N**ormality: 잔차 ~ N(0, σ²) (소표본일 때 inference에 필요).
- **E**qual variance (homoscedasticity): 잔차 분산 일정.
- **추가**: No multicollinearity (X feature 간 상관 낮음).
### 매 Regularized 변종
- **Ridge (L2)**: $\min \|y-X\beta\|^2 + \lambda\|\beta\|_2^2$ — 모든 계수 작게.
- **Lasso (L1)**: $\min \|y-X\beta\|^2 + \lambda\|\beta\|_1$ — sparsity (feature selection).
- **Elastic Net**: L1 + L2 — 상관된 feature 그룹 처리.
### 매 진단
- R² / Adjusted R²: 설명력.
- RMSE / MAE: 예측 오차.
- VIF > 10: multicollinearity 의심.
- Residual plot: 패턴 있으면 비선형.
- QQ plot: normality 체크.
- Cook's distance: 영향력 큰 outlier.
### 매 응용
1. Tabular baseline (어떤 ML이든 첫 모델).
2. Feature 영향 해석 (coefficient).
3. Time-series trend.
4. A/B test effect size.
5. Causal inference (DiD, IV)의 backbone.
## 💻 패턴
### sklearn — 기본 OLS
```python
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression().fit(X_tr, y_tr)
pred = model.predict(X_te)
print("R²:", r2_score(y_te, pred), "RMSE:", np.sqrt(mean_squared_error(y_te, pred)))
print("coef:", dict(zip(feature_names, model.coef_)))
```
### Ridge / Lasso / ElasticNet — CV로 alpha 선택
```python
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
ridge = Pipeline([("sc", StandardScaler()),
("m", RidgeCV(alphas=np.logspace(-3, 3, 50), cv=5))]).fit(X_tr, y_tr)
lasso = Pipeline([("sc", StandardScaler()),
("m", LassoCV(cv=5, max_iter=10000))]).fit(X_tr, y_tr)
en = Pipeline([("sc", StandardScaler()),
("m", ElasticNetCV(l1_ratio=[.1,.5,.7,.9,.95,1], cv=5))]).fit(X_tr, y_tr)
print("ridge alpha:", ridge.named_steps["m"].alpha_)
print("lasso non-zero:", (lasso.named_steps["m"].coef_ != 0).sum())
```
### statsmodels — 통계적 추론 (p-value, CI)
```python
import statsmodels.api as sm
X_const = sm.add_constant(X_tr)
ols = sm.OLS(y_tr, X_const).fit()
print(ols.summary()) # coef, std-err, t, p, [95% CI]
print("Cond no:", ols.condition_number) # >30 multicollinearity 의심
```
### 진단 — VIF + residual plot
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
vif = [variance_inflation_factor(X_tr.values, i) for i in range(X_tr.shape[1])]
print(dict(zip(X_tr.columns, vif))) # >10이면 제거 또는 PCA
resid = y_tr - model.predict(X_tr)
plt.scatter(model.predict(X_tr), resid, alpha=.4); plt.axhline(0)
plt.xlabel("fitted"); plt.ylabel("residual"); plt.show()
```
### Polynomial features (비선형 처리)
```python
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
Xp_tr = poly.fit_transform(X_tr)
Pipeline([("sc", StandardScaler()),
("m", RidgeCV())]).fit(Xp_tr, y_tr) # 항상 Ridge — 차원 폭증
```
### Bayesian linear regression — PyMC
```python
import pymc as pm
with pm.Model() as m:
β = pm.Normal("β", 0, 10, shape=X_tr.shape[1])
α = pm.Normal("α", 0, 10)
σ = pm.HalfNormal("σ", 5)
y_obs = pm.Normal("y", α + X_tr.values @ β, σ, observed=y_tr)
trace = pm.sample(1000, tune=1000, chains=4)
pm.summary(trace, hdi_prob=0.95)
```
### From scratch — gradient descent
```python
import numpy as np
def fit_gd(X, y, lr=1e-2, epochs=2000, l2=0.0):
n, d = X.shape
X_ = np.c_[np.ones(n), X]
w = np.zeros(d + 1)
for _ in range(epochs):
grad = -2/n * X_.T @ (y - X_ @ w) + 2*l2 * np.r_[0, w[1:]]
w -= lr * grad
return w[0], w[1:]
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Baseline 빠르게 | OLS |
| Multicollinearity / p>>n | Ridge |
| Feature selection 원함 | Lasso |
| 상관된 feature 그룹 | Elastic Net |
| 비선형 의심 | Polynomial + Ridge or move to tree |
| 통계적 추론 (p-value) | statsmodels |
**기본값**: StandardScaler + RidgeCV — 안전, 해석 가능, 빠름.
## 🔗 Graph
- 부모: [[Regression]]
- 변형: [[Ridge-Regression]], [[Elastic-Net]], [[Logistic-Regression-Foundations]]
- 응용: [[Time-Series-Analysis|Time-Series-Forecasting]], [[Causal-Inference]]
- Adjacent: [[L1-and-L2-Regularization]], [[Feature Engineering|Feature-Engineering]]
## 🤖 LLM 활용
**언제**: feature engineering ideation, residual plot 해석, statsmodels output 설명.
**언제 X**: 데이터 자체의 outlier 판단 — 도메인 지식 필요.
## ❌ 안티패턴
- **Scaling 안 함**: Ridge/Lasso는 scale에 민감.
- **VIF 무시**: coefficient 부호 뒤집힘.
- **R² 만 보고 판단**: 과적합 못 잡음 — adjusted R² 또는 CV 사용.
- **잔차 plot 안 봄**: 비선형성 놓침.
- **소표본에 polynomial deg 5+**: 폭주, overfit.
## 🧪 검증 / 중복
- Verified (ESL Hastie/Tibshirani, sklearn 1.5+, statsmodels 0.14).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Ridge/Lasso/EN, 진단, Bayesian 추가 |