---
id: wiki-2026-0508-l1-and-l2-regularization
title: L1 and L2 Regularization
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [L1, L2, Lasso, Ridge, ElasticNet, weight decay, regularization]
duplicate_of: none
source_trust_level: A
confidence_score: 0.97
verification_status: applied
tags: [machine-learning, regularization, l1, l2, lasso, ridge, weight-decay]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: scikit-learn / PyTorch
---

# L1 and L2 Regularization

## 매 한 줄
> **"매 weight 의 magnitude 의 의 의 의 penalize"**. L1 (Lasso) → 매 sparsity. L2 (Ridge) → 매 small. ElasticNet (combine). 매 modern: 매 weight decay (DL), 매 AdamW의 decoupled. 매 dropout 도 regularizer.

## 매 핵심

### 매 L1 (Lasso)
- 매 penalty: λ Σ |wᵢ|.
- 매 effect: 매 sparse solutions (zeros).
- 매 응용: 매 feature selection.

### 매 L2 (Ridge)
- 매 penalty: λ Σ wᵢ².
- 매 effect: 매 small but non-zero.
- 매 응용: 매 multicollinearity.

### 매 ElasticNet
- 매 α L1 + (1-α) L2.

### 매 modern DL
- **Weight decay** (= L2).
- **AdamW**: 매 decoupled weight decay (Loshchilov 2019).
- **Dropout**: 매 implicit reg.
- **Batch norm**: 매 implicit reg.
- **Early stopping**: 매 implicit reg.

### 매 응용
1. **Linear regression**: Ridge, Lasso.
2. **Logistic regression**: 매 class_weight + L2.
3. **DL training**: weight decay.
4. **Feature selection**: Lasso.

## 💻 패턴

### Ridge (sklearn)
```python
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0).fit(X, y)
```

### Lasso (sklearn)
```python
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1).fit(X, y)
print((model.coef_ == 0).sum(), 'zero coefficients')
```

### ElasticNet
```python
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)
```

### Logistic + L2
```python
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=1.0).fit(X, y)
# 매 C = 1/alpha (inverse strength)
```

### PyTorch weight decay
```python
import torch
optim = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
# 매 = L2 in SGD
```

### AdamW (decoupled, recommended)
```python
optim = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# 매 better than Adam + weight_decay
```

### Manual L1 in PyTorch
```python
def l1_penalty(model, lam=1e-5):
    return lam * sum(p.abs().sum() for p in model.parameters())

loss = task_loss + l1_penalty(model)
```

### CV-tune α (sklearn)
```python
from sklearn.linear_model import RidgeCV
model = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100]).fit(X, y)
print(model.alpha_)
```

### LassoCV
```python
from sklearn.linear_model import LassoCV
model = LassoCV(alphas=np.logspace(-4, 0, 50), cv=5).fit(X, y)
```

### Path plot (regularization strength sweep)
```python
import matplotlib.pyplot as plt
alphas = np.logspace(-4, 1, 50)
coefs = []
for a in alphas:
    coefs.append(Lasso(alpha=a).fit(X, y).coef_)
plt.plot(alphas, coefs)
plt.xscale('log')
plt.xlabel('alpha'); plt.ylabel('coefficient')
```

### Group L1 (group lasso)
```python
def group_lasso_penalty(weights, groups, lam):
    total = 0
    for group in groups:
        total += lam * np.sqrt(sum(weights[i]**2 for i in group))
    return total
```

### Different decay per layer (DL)
```python
optim = torch.optim.AdamW([
    {'params': model.encoder.parameters(), 'weight_decay': 0.01},
    {'params': model.head.parameters(), 'weight_decay': 0.001},
])
```

### Bias / norm exclude (best practice)
```python
def get_param_groups(model, weight_decay):
    decay, no_decay = [], []
    for name, p in model.named_parameters():
        if p.requires_grad:
            if 'bias' in name or 'norm' in name: no_decay.append(p)
            else: decay.append(p)
    return [
        {'params': decay, 'weight_decay': weight_decay},
        {'params': no_decay, 'weight_decay': 0},
    ]

optim = torch.optim.AdamW(get_param_groups(model, 0.01), lr=1e-3)
```

### Effect on bias-variance
```python
def reg_effect(alphas, X_train, y_train, X_val, y_val):
    train_err, val_err = [], []
    for a in alphas:
        m = Ridge(alpha=a).fit(X_train, y_train)
        train_err.append(((m.predict(X_train) - y_train) ** 2).mean())
        val_err.append(((m.predict(X_val) - y_val) ** 2).mean())
    return train_err, val_err
# 매 high alpha → train ↑, val ↓ (until point) → val ↑ (over-reg)
```

### Sparsity-induced (modern DL)
```python
def magnitude_pruning(model, sparsity=0.5):
    """매 매 layer 의 의 의 의 magnitude bottom-x% 의 zero out."""
    for name, p in model.named_parameters():
        if 'weight' in name:
            threshold = p.abs().flatten().kthvalue(int(p.numel() * sparsity)).values
            p.data[p.abs() < threshold] = 0
```

## 매 결정 기준
| 상황 | Method |
|---|---|
| Linear | Ridge / Lasso / ElasticNet |
| Feature selection | Lasso |
| Multicollinearity | Ridge |
| DL | AdamW weight decay |
| Sparsity goal | Lasso / pruning |
| Best DL practice | AdamW + exclude bias/norm |

**기본값**: 매 DL = AdamW + 0.01-0.1 weight decay + bias/norm exclude. 매 linear = ElasticNet CV. 매 sparsity = Lasso.

## 🔗 Graph
- 부모: [[L1-and-L2-Regularization|Regularization]] · [[Optimization]]
- 변형: [[Lasso]] · [[Ridge]] · [[ElasticNet]] · [[Weight-Decay]]
- Adjacent: [[Generalization-in-AI]]

## 🤖 LLM 활용
**언제**: 매 모든 ML training.
**언제 X**: 매 underfit (no need).

## ❌ 안티패턴
- **Adam + weight_decay**: 매 use AdamW.
- **Same decay for bias / norm**: 매 hurt training.
- **No CV α**: 매 wrong strength.
- **L1 for DL** (without sparsity goal): 매 unstable.

## 🧪 검증 / 중복
- Verified (Hastie-Tibshirani-Friedman, Loshchilov AdamW 2019).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — L1/L2 + 매 sklearn / AdamW / param groups / pruning code |