f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
199 lines
6.3 KiB
Markdown
199 lines
6.3 KiB
Markdown
---
|
|
id: wiki-2026-0508-logistic-regression-foundations
|
|
title: Logistic Regression Foundations
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Logistic Regression, Logit, Softmax Regression, Multinomial Logistic]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [machine-learning, classification, sklearn, mle, calibration]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: scikit-learn/statsmodels
|
|
---
|
|
|
|
# Logistic Regression Foundations
|
|
|
|
## 매 한 줄
|
|
> **"매 분류의 baseline — 선형 logit + sigmoid + MLE"**. Logistic regression은 linear model로 log-odds를 추정하고 sigmoid로 확률화한 classifier. 해석성·calibration·속도가 뛰어나 production에서 여전히 1순위 baseline. 다중 class는 softmax(=multinomial)로 일반화.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 수식
|
|
- 모델: $P(y=1|x) = \sigma(x^T\beta) = \frac{1}{1+e^{-x^T\beta}}$.
|
|
- Logit (log-odds): $\log\frac{p}{1-p} = x^T\beta$.
|
|
- Loss (NLL/BCE): $-\sum_i [y_i\log p_i + (1-y_i)\log(1-p_i)]$.
|
|
- 추정: MLE — closed-form 없음, IRLS / L-BFGS / SGD.
|
|
|
|
### 매 해석
|
|
- $\beta_j$ 1 단위 증가 → log-odds가 $\beta_j$ 만큼 증가.
|
|
- $e^{\beta_j}$ = odds ratio (1.5면 50% odds 증가).
|
|
- Sign과 magnitude 둘 다 의미 있음 (단, scale 통일 필수).
|
|
|
|
### 매 Multinomial / softmax
|
|
- $P(y=k|x) = \frac{e^{x^T\beta_k}}{\sum_j e^{x^T\beta_j}}$.
|
|
- sklearn `multi_class="multinomial"` (default 2026).
|
|
- One-vs-Rest는 클래스 분포 불균형 시 유리할 수 있음.
|
|
|
|
### 매 Regularization
|
|
- L2 (default): $C = 1/\lambda$.
|
|
- L1: feature selection.
|
|
- Elastic Net: SAGA solver.
|
|
|
|
### 매 Calibration
|
|
- LR은 보통 잘 calibrated이지만 imbalanced + regularized하면 이탈.
|
|
- 검증: `CalibratedClassifierCV`, reliability diagram.
|
|
|
|
### 매 응용
|
|
1. CTR / 전환 예측.
|
|
2. 신용 평가 (해석 필수 도메인).
|
|
3. 의료 risk score.
|
|
4. 텍스트 분류 (TF-IDF + LR — 강력한 baseline).
|
|
5. A/B test의 효과 추정 (treatment dummy).
|
|
|
|
## 💻 패턴
|
|
|
|
### sklearn — 기본
|
|
```python
|
|
from sklearn.linear_model import LogisticRegression
|
|
from sklearn.preprocessing import StandardScaler
|
|
from sklearn.pipeline import Pipeline
|
|
|
|
clf = Pipeline([
|
|
("sc", StandardScaler()),
|
|
("lr", LogisticRegression(C=1.0, max_iter=1000, n_jobs=-1)),
|
|
]).fit(X_tr, y_tr)
|
|
proba = clf.predict_proba(X_te)[:, 1]
|
|
```
|
|
|
|
### Hyper-param search (C, penalty)
|
|
```python
|
|
from sklearn.model_selection import GridSearchCV
|
|
grid = {
|
|
"lr__C": [0.01, 0.1, 1, 10],
|
|
"lr__penalty": ["l1", "l2"],
|
|
"lr__solver": ["saga"],
|
|
}
|
|
gs = GridSearchCV(clf, grid, cv=5, scoring="roc_auc", n_jobs=-1).fit(X_tr, y_tr)
|
|
print(gs.best_params_, gs.best_score_)
|
|
```
|
|
|
|
### Multinomial (softmax)
|
|
```python
|
|
multi = LogisticRegression(
|
|
multi_class="multinomial", solver="lbfgs",
|
|
C=1.0, max_iter=2000,
|
|
).fit(X_tr, y_tr)
|
|
print(multi.classes_, multi.coef_.shape) # (n_classes, n_features)
|
|
```
|
|
|
|
### statsmodels — odds ratio + p-value
|
|
```python
|
|
import statsmodels.api as sm
|
|
X_const = sm.add_constant(X_tr)
|
|
logit = sm.Logit(y_tr, X_const).fit(disp=False)
|
|
print(logit.summary())
|
|
import numpy as np
|
|
print("Odds ratios:")
|
|
print(np.exp(logit.params))
|
|
```
|
|
|
|
### Imbalanced classes
|
|
```python
|
|
# 1) class_weight
|
|
LogisticRegression(class_weight="balanced")
|
|
# 2) threshold tuning (default 0.5는 거의 항상 잘못)
|
|
from sklearn.metrics import precision_recall_curve
|
|
prec, rec, thr = precision_recall_curve(y_te, proba)
|
|
f1 = 2*prec*rec / (prec+rec+1e-9)
|
|
best_thr = thr[f1[:-1].argmax()]
|
|
pred = (proba >= best_thr).astype(int)
|
|
```
|
|
|
|
### Calibration 점검 + 보정
|
|
```python
|
|
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
|
|
import matplotlib.pyplot as plt
|
|
|
|
cal = CalibratedClassifierCV(clf, method="isotonic", cv=5).fit(X_tr, y_tr)
|
|
prob_cal = cal.predict_proba(X_te)[:, 1]
|
|
prob_obs, prob_pred = calibration_curve(y_te, prob_cal, n_bins=10)
|
|
plt.plot(prob_pred, prob_obs, marker="o"); plt.plot([0,1],[0,1],"--"); plt.show()
|
|
```
|
|
|
|
### From scratch — gradient descent
|
|
```python
|
|
import numpy as np
|
|
def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
|
|
def fit_lr(X, y, lr=0.1, epochs=1000, l2=0.01):
|
|
X = np.c_[np.ones(len(X)), X]; w = np.zeros(X.shape[1])
|
|
for _ in range(epochs):
|
|
p = sigmoid(X @ w)
|
|
grad = X.T @ (p - y) / len(y) + l2 * np.r_[0, w[1:]]
|
|
w -= lr * grad
|
|
return w
|
|
```
|
|
|
|
### PyTorch — large-scale logistic
|
|
```python
|
|
import torch, torch.nn as nn
|
|
class LR(nn.Module):
|
|
def __init__(self, d, k):
|
|
super().__init__()
|
|
self.lin = nn.Linear(d, k)
|
|
def forward(self, x): return self.lin(x)
|
|
|
|
model = LR(X.shape[1], n_classes).cuda()
|
|
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
|
|
loss_fn = nn.CrossEntropyLoss()
|
|
for epoch in range(20):
|
|
for xb, yb in loader:
|
|
opt.zero_grad()
|
|
loss = loss_fn(model(xb.cuda()), yb.cuda())
|
|
loss.backward(); opt.step()
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Tabular baseline | sklearn LR + StandardScaler |
|
|
| 해석 필요 (의료/금융) | statsmodels Logit + odds ratio |
|
|
| Imbalanced | class_weight + threshold tuning |
|
|
| Multi-class | multinomial + lbfgs |
|
|
| Sparse / feature selection | L1 + saga |
|
|
| Probability를 점수로 사용 | CalibratedClassifierCV |
|
|
| 대규모 | PyTorch SGD/Adam |
|
|
|
|
**기본값**: StandardScaler + LR(C=1, L2) + threshold tuning.
|
|
|
|
## 🔗 Graph
|
|
- 변형: [[Linear-Discriminant-Analysis]]
|
|
- Adjacent: [[Linear-Regression-Mastery]], [[L1-and-L2-Regularization]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 결과 해석 (odds ratio 설명), feature importance narrative, calibration plot 코멘트.
|
|
**언제 X**: 도메인 cutoff (금융 risk threshold) — 비즈니스/규제가 결정.
|
|
|
|
## ❌ 안티패턴
|
|
- **Scaling 없이 regularize**: 큰 scale feature가 패널티 지배.
|
|
- **Threshold 0.5 그대로**: imbalanced일 때 거의 잘못된 선택.
|
|
- **Probability를 그대로 신뢰**: calibration 안 함.
|
|
- **Multi-class에 OvR만 사용**: 클래스 간 정보 손실.
|
|
- **수렴 안 되는데 max_iter 안 늘림**: warning 무시 → 부정확한 coef.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (ESL Ch.4, sklearn 1.5+, statsmodels 0.14, Kaggle baselines).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — calibration, threshold tuning, multinomial 추가 |
|