[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,62 +2,199 @@
 id: wiki-2026-0508-logistic-regression-foundations
 title: Logistic Regression Foundations
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [ML-LOG-REG-001]
+aliases: [Logistic Regression, Logit, Softmax Regression, Multinomial Logistic]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [machine-learning, Logistic-Regression, classification, Supervised-Learning, sigmoid]
+confidence_score: 0.9
+verification_status: applied
+tags: [machine-learning, classification, sklearn, mle, calibration]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+tech_stack:
+  language: python
+  framework: scikit-learn/statsmodels
 ---

-# Logistic Regression Foundations (로지스틱 회귀 기초)
+# Logistic Regression Foundations

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "세상의 모든 질문을 '예(1)' 혹은 '아니오(0)'의 확률로 변환하여, 모호함의 경계에 명확한 선을 그어라" — 선형 회귀의 출력값을 시그모이드(Sigmoid) 함수를 통해 0과 1 사이의 확률로 변환하여 이진 분류 문제를 해결하는 가장 기본적이고 강력한 알고리즘.
+## 매 한 줄
+> **"매 분류의 baseline — 선형 logit + sigmoid + MLE"**. Logistic regression은 linear model로 log-odds를 추정하고 sigmoid로 확률화한 classifier. 해석성·calibration·속도가 뛰어나 production에서 여전히 1순위 baseline. 다중 class는 softmax(=multinomial)로 일반화.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** "Probabilistic Binary Classification" — 선형 결합($z = wx + b$)의 결과를 확률 공간($0 \le p \le 1$)으로 매핑하고, 임계값(Threshold)을 기준으로 데이터를 두 집단으로 나누는 확률 기반 분류 패턴.
- **핵심 요소:**
-    - **Sigmoid Function:** 어떤 실수값이든 0과 1 사이로 압축하는 비선형 함수.
-    - **Decision Boundary:** 확률 0.5를 기준으로 클래스를 가르는 경계선.
-    - **Binary [[Cross-Entropy Loss|Cross-Entropy Loss]]:** 예측 확률과 실제 레이블 사이의 오차를 측정하는 손실 함수.
- **의의:** 스팸 메일 분류, 질병 유무 판별 등 수많은 이진 분류 문제의 표준 모델이며, 딥러닝 뉴런의 동작 원리를 이해하는 핵심 가교 역할.
+## 매 핵심

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 이름은 '회귀(Regression)'이지만 실제로는 '분류(Classification)'에 사용된다는 점이 초심자에게 혼란을 주나, 출력값이 확률이라는 연속적인 수치라는 점에서 통계학적 회귀의 범주에 포함됨을 이해하는 것이 중요.
- **정책 변화:** Antigravity 프로젝트는 에이전트의 특정 행동 수행 여부(Success/Fail)를 예측하는 가벼운 판단 모듈 설계 시, 연산 효율이 극대화된 로지스틱 회귀를 우선적으로 고려함.
+### 매 수식
+- 모델: $P(y=1|x) = \sigma(x^T\beta) = \frac{1}{1+e^{-x^T\beta}}$.
+- Logit (log-odds): $\log\frac{p}{1-p} = x^T\beta$.
+- Loss (NLL/BCE): $-\sum_i [y_i\log p_i + (1-y_i)\log(1-p_i)]$.
+- 추정: MLE — closed-form 없음, IRLS / L-BFGS / SGD.

-## 🔗 지식 연결 (Graph)
- [[Linear-Regression-Mastery|Linear-Regression-Mastery]], [[Deep-Learning|Deep-Learning]]-Foundations, [[Loss-Functions-Foundations|Loss-Functions-Foundations]], [[Supervised-Learning-Foundations|Supervised-Learning-Foundations]]
- **Raw Source:** 10_Wiki/Topics/AI/Logistic-Regression-Foundations.md
+### 매 해석
+- $\beta_j$ 1 단위 증가 → log-odds가 $\beta_j$ 만큼 증가.
+- $e^{\beta_j}$ = odds ratio (1.5면 50% odds 증가).
+- Sign과 magnitude 둘 다 의미 있음 (단, scale 통일 필수).

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### 매 Multinomial / softmax
+- $P(y=k|x) = \frac{e^{x^T\beta_k}}{\sum_j e^{x^T\beta_j}}$.
+- sklearn `multi_class="multinomial"` (default 2026).
+- One-vs-Rest는 클래스 분포 불균형 시 유리할 수 있음.

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+### 매 Regularization
+- L2 (default): $C = 1/\lambda$.
+- L1: feature selection.
+- Elastic Net: SAGA solver.

-**언제 쓰면 안 되는가:**
- *(TODO)*
+### 매 Calibration
+- LR은 보통 잘 calibrated이지만 imbalanced + regularized하면 이탈.
+- 검증: `CalibratedClassifierCV`, reliability diagram.

-## 🧪 검증 상태 (Validation)
+### 매 응용
+1. CTR / 전환 예측.
+2. 신용 평가 (해석 필수 도메인).
+3. 의료 risk score.
+4. 텍스트 분류 (TF-IDF + LR — 강력한 baseline).
+5. A/B test의 효과 추정 (treatment dummy).

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+## 💻 패턴

-## 🧬 중복 검사 (Duplicate Check)
+### sklearn — 기본
+```python
+from sklearn.linear_model import LogisticRegression
+from sklearn.preprocessing import StandardScaler
+from sklearn.pipeline import Pipeline

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+clf = Pipeline([
+    ("sc", StandardScaler()),
+    ("lr", LogisticRegression(C=1.0, max_iter=1000, n_jobs=-1)),
+]).fit(X_tr, y_tr)
+proba = clf.predict_proba(X_te)[:, 1]
+```

-## 🕓 변경 이력 (Changelog)
+### Hyper-param search (C, penalty)
+```python
+from sklearn.model_selection import GridSearchCV
+grid = {
+    "lr__C": [0.01, 0.1, 1, 10],
+    "lr__penalty": ["l1", "l2"],
+    "lr__solver": ["saga"],
+}
+gs = GridSearchCV(clf, grid, cv=5, scoring="roc_auc", n_jobs=-1).fit(X_tr, y_tr)
+print(gs.best_params_, gs.best_score_)
+```

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+### Multinomial (softmax)
+```python
+multi = LogisticRegression(
+    multi_class="multinomial", solver="lbfgs",
+    C=1.0, max_iter=2000,
+).fit(X_tr, y_tr)
+print(multi.classes_, multi.coef_.shape)   # (n_classes, n_features)
+```
+
+### statsmodels — odds ratio + p-value
+```python
+import statsmodels.api as sm
+X_const = sm.add_constant(X_tr)
+logit = sm.Logit(y_tr, X_const).fit(disp=False)
+print(logit.summary())
+import numpy as np
+print("Odds ratios:")
+print(np.exp(logit.params))
+```
+
+### Imbalanced classes
+```python
+# 1) class_weight
+LogisticRegression(class_weight="balanced")
+# 2) threshold tuning (default 0.5는 거의 항상 잘못)
+from sklearn.metrics import precision_recall_curve
+prec, rec, thr = precision_recall_curve(y_te, proba)
+f1 = 2*prec*rec / (prec+rec+1e-9)
+best_thr = thr[f1[:-1].argmax()]
+pred = (proba >= best_thr).astype(int)
+```
+
+### Calibration 점검 + 보정
+```python
+from sklearn.calibration import CalibratedClassifierCV, calibration_curve
+import matplotlib.pyplot as plt
+
+cal = CalibratedClassifierCV(clf, method="isotonic", cv=5).fit(X_tr, y_tr)
+prob_cal = cal.predict_proba(X_te)[:, 1]
+prob_obs, prob_pred = calibration_curve(y_te, prob_cal, n_bins=10)
+plt.plot(prob_pred, prob_obs, marker="o"); plt.plot([0,1],[0,1],"--"); plt.show()
+```
+
+### From scratch — gradient descent
+```python
+import numpy as np
+def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
+def fit_lr(X, y, lr=0.1, epochs=1000, l2=0.01):
+    X = np.c_[np.ones(len(X)), X]; w = np.zeros(X.shape[1])
+    for _ in range(epochs):
+        p = sigmoid(X @ w)
+        grad = X.T @ (p - y) / len(y) + l2 * np.r_[0, w[1:]]
+        w -= lr * grad
+    return w
+```
+
+### PyTorch — large-scale logistic
+```python
+import torch, torch.nn as nn
+class LR(nn.Module):
+    def __init__(self, d, k):
+        super().__init__()
+        self.lin = nn.Linear(d, k)
+    def forward(self, x): return self.lin(x)
+
+model = LR(X.shape[1], n_classes).cuda()
+opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
+loss_fn = nn.CrossEntropyLoss()
+for epoch in range(20):
+    for xb, yb in loader:
+        opt.zero_grad()
+        loss = loss_fn(model(xb.cuda()), yb.cuda())
+        loss.backward(); opt.step()
+```
+
+## 매 결정 기준
+| 상황 | Approach |
+|---|---|
+| Tabular baseline | sklearn LR + StandardScaler |
+| 해석 필요 (의료/금융) | statsmodels Logit + odds ratio |
+| Imbalanced | class_weight + threshold tuning |
+| Multi-class | multinomial + lbfgs |
+| Sparse / feature selection | L1 + saga |
+| Probability를 점수로 사용 | CalibratedClassifierCV |
+| 대규모 | PyTorch SGD/Adam |
+
+**기본값**: StandardScaler + LR(C=1, L2) + threshold tuning.
+
+## 🔗 Graph
+- 부모: [[Supervised-Learning]], [[Classification]]
+- 변형: [[Multinomial-Logistic-Regression]], [[Probit]], [[Linear-Discriminant-Analysis]]
+- 응용: [[CTR-Prediction]], [[Credit-Scoring]], [[Risk-Models]]
+- Adjacent: [[Linear-Regression-Mastery]], [[L1-and-L2-Regularization]], [[Calibration]]
+
+## 🤖 LLM 활용
+**언제**: 결과 해석 (odds ratio 설명), feature importance narrative, calibration plot 코멘트.
+**언제 X**: 도메인 cutoff (금융 risk threshold) — 비즈니스/규제가 결정.
+
+## ❌ 안티패턴
+- **Scaling 없이 regularize**: 큰 scale feature가 패널티 지배.
+- **Threshold 0.5 그대로**: imbalanced일 때 거의 잘못된 선택.
+- **Probability를 그대로 신뢰**: calibration 안 함.
+- **Multi-class에 OvR만 사용**: 클래스 간 정보 손실.
+- **수렴 안 되는데 max_iter 안 늘림**: warning 무시 → 부정확한 coef.
+
+## 🧪 검증 / 중복
+- Verified (ESL Ch.4, sklearn 1.5+, statsmodels 0.14, Kaggle baselines).
+- 신뢰도 A.
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — calibration, threshold tuning, multinomial 추가 |