Files
2nd/10_Wiki/Topics/AI_and_ML/Logistic-Regression-Foundations.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.3 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-logistic-regression-foundations Logistic Regression Foundations 10_Wiki/Topics verified self
Logistic Regression
Logit
Softmax Regression
Multinomial Logistic
none A 0.9 applied
machine-learning
classification
sklearn
mle
calibration
2026-05-10 pending
language framework
python scikit-learn/statsmodels

Logistic Regression Foundations

매 한 줄

"매 분류의 baseline — 선형 logit + sigmoid + MLE". Logistic regression은 linear model로 log-odds를 추정하고 sigmoid로 확률화한 classifier. 해석성·calibration·속도가 뛰어나 production에서 여전히 1순위 baseline. 다중 class는 softmax(=multinomial)로 일반화.

매 핵심

매 수식

  • 모델: P(y=1|x) = \sigma(x^T\beta) = \frac{1}{1+e^{-x^T\beta}}.
  • Logit (log-odds): \log\frac{p}{1-p} = x^T\beta.
  • Loss (NLL/BCE): -\sum_i [y_i\log p_i + (1-y_i)\log(1-p_i)].
  • 추정: MLE — closed-form 없음, IRLS / L-BFGS / SGD.

매 해석

  • \beta_j 1 단위 증가 → log-odds가 \beta_j 만큼 증가.
  • e^{\beta_j} = odds ratio (1.5면 50% odds 증가).
  • Sign과 magnitude 둘 다 의미 있음 (단, scale 통일 필수).

매 Multinomial / softmax

  • P(y=k|x) = \frac{e^{x^T\beta_k}}{\sum_j e^{x^T\beta_j}}.
  • sklearn multi_class="multinomial" (default 2026).
  • One-vs-Rest는 클래스 분포 불균형 시 유리할 수 있음.

매 Regularization

  • L2 (default): C = 1/\lambda.
  • L1: feature selection.
  • Elastic Net: SAGA solver.

매 Calibration

  • LR은 보통 잘 calibrated이지만 imbalanced + regularized하면 이탈.
  • 검증: CalibratedClassifierCV, reliability diagram.

매 응용

  1. CTR / 전환 예측.
  2. 신용 평가 (해석 필수 도메인).
  3. 의료 risk score.
  4. 텍스트 분류 (TF-IDF + LR — 강력한 baseline).
  5. A/B test의 효과 추정 (treatment dummy).

💻 패턴

sklearn — 기본

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ("sc", StandardScaler()),
    ("lr", LogisticRegression(C=1.0, max_iter=1000, n_jobs=-1)),
]).fit(X_tr, y_tr)
proba = clf.predict_proba(X_te)[:, 1]

Hyper-param search (C, penalty)

from sklearn.model_selection import GridSearchCV
grid = {
    "lr__C": [0.01, 0.1, 1, 10],
    "lr__penalty": ["l1", "l2"],
    "lr__solver": ["saga"],
}
gs = GridSearchCV(clf, grid, cv=5, scoring="roc_auc", n_jobs=-1).fit(X_tr, y_tr)
print(gs.best_params_, gs.best_score_)

Multinomial (softmax)

multi = LogisticRegression(
    multi_class="multinomial", solver="lbfgs",
    C=1.0, max_iter=2000,
).fit(X_tr, y_tr)
print(multi.classes_, multi.coef_.shape)   # (n_classes, n_features)

statsmodels — odds ratio + p-value

import statsmodels.api as sm
X_const = sm.add_constant(X_tr)
logit = sm.Logit(y_tr, X_const).fit(disp=False)
print(logit.summary())
import numpy as np
print("Odds ratios:")
print(np.exp(logit.params))

Imbalanced classes

# 1) class_weight
LogisticRegression(class_weight="balanced")
# 2) threshold tuning (default 0.5는 거의 항상 잘못)
from sklearn.metrics import precision_recall_curve
prec, rec, thr = precision_recall_curve(y_te, proba)
f1 = 2*prec*rec / (prec+rec+1e-9)
best_thr = thr[f1[:-1].argmax()]
pred = (proba >= best_thr).astype(int)

Calibration 점검 + 보정

from sklearn.calibration import CalibratedClassifierCV, calibration_curve
import matplotlib.pyplot as plt

cal = CalibratedClassifierCV(clf, method="isotonic", cv=5).fit(X_tr, y_tr)
prob_cal = cal.predict_proba(X_te)[:, 1]
prob_obs, prob_pred = calibration_curve(y_te, prob_cal, n_bins=10)
plt.plot(prob_pred, prob_obs, marker="o"); plt.plot([0,1],[0,1],"--"); plt.show()

From scratch — gradient descent

import numpy as np
def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
def fit_lr(X, y, lr=0.1, epochs=1000, l2=0.01):
    X = np.c_[np.ones(len(X)), X]; w = np.zeros(X.shape[1])
    for _ in range(epochs):
        p = sigmoid(X @ w)
        grad = X.T @ (p - y) / len(y) + l2 * np.r_[0, w[1:]]
        w -= lr * grad
    return w

PyTorch — large-scale logistic

import torch, torch.nn as nn
class LR(nn.Module):
    def __init__(self, d, k):
        super().__init__()
        self.lin = nn.Linear(d, k)
    def forward(self, x): return self.lin(x)

model = LR(X.shape[1], n_classes).cuda()
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(20):
    for xb, yb in loader:
        opt.zero_grad()
        loss = loss_fn(model(xb.cuda()), yb.cuda())
        loss.backward(); opt.step()

매 결정 기준

상황 Approach
Tabular baseline sklearn LR + StandardScaler
해석 필요 (의료/금융) statsmodels Logit + odds ratio
Imbalanced class_weight + threshold tuning
Multi-class multinomial + lbfgs
Sparse / feature selection L1 + saga
Probability를 점수로 사용 CalibratedClassifierCV
대규모 PyTorch SGD/Adam

기본값: StandardScaler + LR(C=1, L2) + threshold tuning.

🔗 Graph

🤖 LLM 활용

언제: 결과 해석 (odds ratio 설명), feature importance narrative, calibration plot 코멘트. 언제 X: 도메인 cutoff (금융 risk threshold) — 비즈니스/규제가 결정.

안티패턴

  • Scaling 없이 regularize: 큰 scale feature가 패널티 지배.
  • Threshold 0.5 그대로: imbalanced일 때 거의 잘못된 선택.
  • Probability를 그대로 신뢰: calibration 안 함.
  • Multi-class에 OvR만 사용: 클래스 간 정보 손실.
  • 수렴 안 되는데 max_iter 안 늘림: warning 무시 → 부정확한 coef.

🧪 검증 / 중복

  • Verified (ESL Ch.4, sklearn 1.5+, statsmodels 0.14, Kaggle baselines).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — calibration, threshold tuning, multinomial 추가