f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
219 lines
5.9 KiB
Markdown
219 lines
5.9 KiB
Markdown
---
|
||
id: wiki-2026-0508-l1-and-l2-regularization
|
||
title: L1 and L2 Regularization
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [L1, L2, Lasso, Ridge, ElasticNet, weight decay, regularization]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.97
|
||
verification_status: applied
|
||
tags: [machine-learning, regularization, l1, l2, lasso, ridge, weight-decay]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python
|
||
framework: scikit-learn / PyTorch
|
||
---
|
||
|
||
# L1 and L2 Regularization
|
||
|
||
## 매 한 줄
|
||
> **"매 weight 의 magnitude 의 의 의 의 penalize"**. L1 (Lasso) → 매 sparsity. L2 (Ridge) → 매 small. ElasticNet (combine). 매 modern: 매 weight decay (DL), 매 AdamW의 decoupled. 매 dropout 도 regularizer.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 L1 (Lasso)
|
||
- 매 penalty: λ Σ |wᵢ|.
|
||
- 매 effect: 매 sparse solutions (zeros).
|
||
- 매 응용: 매 feature selection.
|
||
|
||
### 매 L2 (Ridge)
|
||
- 매 penalty: λ Σ wᵢ².
|
||
- 매 effect: 매 small but non-zero.
|
||
- 매 응용: 매 multicollinearity.
|
||
|
||
### 매 ElasticNet
|
||
- 매 α L1 + (1-α) L2.
|
||
|
||
### 매 modern DL
|
||
- **Weight decay** (= L2).
|
||
- **AdamW**: 매 decoupled weight decay (Loshchilov 2019).
|
||
- **Dropout**: 매 implicit reg.
|
||
- **Batch norm**: 매 implicit reg.
|
||
- **Early stopping**: 매 implicit reg.
|
||
|
||
### 매 응용
|
||
1. **Linear regression**: Ridge, Lasso.
|
||
2. **Logistic regression**: 매 class_weight + L2.
|
||
3. **DL training**: weight decay.
|
||
4. **Feature selection**: Lasso.
|
||
|
||
## 💻 패턴
|
||
|
||
### Ridge (sklearn)
|
||
```python
|
||
from sklearn.linear_model import Ridge
|
||
model = Ridge(alpha=1.0).fit(X, y)
|
||
```
|
||
|
||
### Lasso (sklearn)
|
||
```python
|
||
from sklearn.linear_model import Lasso
|
||
model = Lasso(alpha=0.1).fit(X, y)
|
||
print((model.coef_ == 0).sum(), 'zero coefficients')
|
||
```
|
||
|
||
### ElasticNet
|
||
```python
|
||
from sklearn.linear_model import ElasticNet
|
||
model = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)
|
||
```
|
||
|
||
### Logistic + L2
|
||
```python
|
||
from sklearn.linear_model import LogisticRegression
|
||
model = LogisticRegression(penalty='l2', C=1.0).fit(X, y)
|
||
# 매 C = 1/alpha (inverse strength)
|
||
```
|
||
|
||
### PyTorch weight decay
|
||
```python
|
||
import torch
|
||
optim = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
|
||
# 매 = L2 in SGD
|
||
```
|
||
|
||
### AdamW (decoupled, recommended)
|
||
```python
|
||
optim = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
|
||
# 매 better than Adam + weight_decay
|
||
```
|
||
|
||
### Manual L1 in PyTorch
|
||
```python
|
||
def l1_penalty(model, lam=1e-5):
|
||
return lam * sum(p.abs().sum() for p in model.parameters())
|
||
|
||
loss = task_loss + l1_penalty(model)
|
||
```
|
||
|
||
### CV-tune α (sklearn)
|
||
```python
|
||
from sklearn.linear_model import RidgeCV
|
||
model = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100]).fit(X, y)
|
||
print(model.alpha_)
|
||
```
|
||
|
||
### LassoCV
|
||
```python
|
||
from sklearn.linear_model import LassoCV
|
||
model = LassoCV(alphas=np.logspace(-4, 0, 50), cv=5).fit(X, y)
|
||
```
|
||
|
||
### Path plot (regularization strength sweep)
|
||
```python
|
||
import matplotlib.pyplot as plt
|
||
alphas = np.logspace(-4, 1, 50)
|
||
coefs = []
|
||
for a in alphas:
|
||
coefs.append(Lasso(alpha=a).fit(X, y).coef_)
|
||
plt.plot(alphas, coefs)
|
||
plt.xscale('log')
|
||
plt.xlabel('alpha'); plt.ylabel('coefficient')
|
||
```
|
||
|
||
### Group L1 (group lasso)
|
||
```python
|
||
def group_lasso_penalty(weights, groups, lam):
|
||
total = 0
|
||
for group in groups:
|
||
total += lam * np.sqrt(sum(weights[i]**2 for i in group))
|
||
return total
|
||
```
|
||
|
||
### Different decay per layer (DL)
|
||
```python
|
||
optim = torch.optim.AdamW([
|
||
{'params': model.encoder.parameters(), 'weight_decay': 0.01},
|
||
{'params': model.head.parameters(), 'weight_decay': 0.001},
|
||
])
|
||
```
|
||
|
||
### Bias / norm exclude (best practice)
|
||
```python
|
||
def get_param_groups(model, weight_decay):
|
||
decay, no_decay = [], []
|
||
for name, p in model.named_parameters():
|
||
if p.requires_grad:
|
||
if 'bias' in name or 'norm' in name: no_decay.append(p)
|
||
else: decay.append(p)
|
||
return [
|
||
{'params': decay, 'weight_decay': weight_decay},
|
||
{'params': no_decay, 'weight_decay': 0},
|
||
]
|
||
|
||
optim = torch.optim.AdamW(get_param_groups(model, 0.01), lr=1e-3)
|
||
```
|
||
|
||
### Effect on bias-variance
|
||
```python
|
||
def reg_effect(alphas, X_train, y_train, X_val, y_val):
|
||
train_err, val_err = [], []
|
||
for a in alphas:
|
||
m = Ridge(alpha=a).fit(X_train, y_train)
|
||
train_err.append(((m.predict(X_train) - y_train) ** 2).mean())
|
||
val_err.append(((m.predict(X_val) - y_val) ** 2).mean())
|
||
return train_err, val_err
|
||
# 매 high alpha → train ↑, val ↓ (until point) → val ↑ (over-reg)
|
||
```
|
||
|
||
### Sparsity-induced (modern DL)
|
||
```python
|
||
def magnitude_pruning(model, sparsity=0.5):
|
||
"""매 매 layer 의 의 의 의 magnitude bottom-x% 의 zero out."""
|
||
for name, p in model.named_parameters():
|
||
if 'weight' in name:
|
||
threshold = p.abs().flatten().kthvalue(int(p.numel() * sparsity)).values
|
||
p.data[p.abs() < threshold] = 0
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Method |
|
||
|---|---|
|
||
| Linear | Ridge / Lasso / ElasticNet |
|
||
| Feature selection | Lasso |
|
||
| Multicollinearity | Ridge |
|
||
| DL | AdamW weight decay |
|
||
| Sparsity goal | Lasso / pruning |
|
||
| Best DL practice | AdamW + exclude bias/norm |
|
||
|
||
**기본값**: 매 DL = AdamW + 0.01-0.1 weight decay + bias/norm exclude. 매 linear = ElasticNet CV. 매 sparsity = Lasso.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[L1-and-L2-Regularization|Regularization]] · [[Optimization]]
|
||
- 변형: [[Lasso]] · [[Ridge]] · [[ElasticNet]] · [[Weight-Decay]]
|
||
- Adjacent: [[Generalization-in-AI]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 모든 ML training.
|
||
**언제 X**: 매 underfit (no need).
|
||
|
||
## ❌ 안티패턴
|
||
- **Adam + weight_decay**: 매 use AdamW.
|
||
- **Same decay for bias / norm**: 매 hurt training.
|
||
- **No CV α**: 매 wrong strength.
|
||
- **L1 for DL** (without sparsity goal): 매 unstable.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Hastie-Tibshirani-Friedman, Loshchilov AdamW 2019).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — L1/L2 + 매 sklearn / AdamW / param groups / pruning code |
|