--- id: wiki-2026-0508-l1-and-l2-regularization title: L1 and L2 Regularization category: 10_Wiki/Topics status: verified canonical_id: self aliases: [L1, L2, Lasso, Ridge, ElasticNet, weight decay, regularization] duplicate_of: none source_trust_level: A confidence_score: 0.97 verification_status: applied tags: [machine-learning, regularization, l1, l2, lasso, ridge, weight-decay] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: scikit-learn / PyTorch --- # L1 and L2 Regularization ## 매 한 줄 > **"매 weight 의 magnitude 의 의 의 의 penalize"**. L1 (Lasso) → 매 sparsity. L2 (Ridge) → 매 small. ElasticNet (combine). 매 modern: 매 weight decay (DL), 매 AdamW의 decoupled. 매 dropout 도 regularizer. ## 매 핵심 ### 매 L1 (Lasso) - 매 penalty: λ Σ |wᵢ|. - 매 effect: 매 sparse solutions (zeros). - 매 응용: 매 feature selection. ### 매 L2 (Ridge) - 매 penalty: λ Σ wᵢ². - 매 effect: 매 small but non-zero. - 매 응용: 매 multicollinearity. ### 매 ElasticNet - 매 α L1 + (1-α) L2. ### 매 modern DL - **Weight decay** (= L2). - **AdamW**: 매 decoupled weight decay (Loshchilov 2019). - **Dropout**: 매 implicit reg. - **Batch norm**: 매 implicit reg. - **Early stopping**: 매 implicit reg. ### 매 응용 1. **Linear regression**: Ridge, Lasso. 2. **Logistic regression**: 매 class_weight + L2. 3. **DL training**: weight decay. 4. **Feature selection**: Lasso. ## 💻 패턴 ### Ridge (sklearn) ```python from sklearn.linear_model import Ridge model = Ridge(alpha=1.0).fit(X, y) ``` ### Lasso (sklearn) ```python from sklearn.linear_model import Lasso model = Lasso(alpha=0.1).fit(X, y) print((model.coef_ == 0).sum(), 'zero coefficients') ``` ### ElasticNet ```python from sklearn.linear_model import ElasticNet model = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y) ``` ### Logistic + L2 ```python from sklearn.linear_model import LogisticRegression model = LogisticRegression(penalty='l2', C=1.0).fit(X, y) # 매 C = 1/alpha (inverse strength) ``` ### PyTorch weight decay ```python import torch optim = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4) # 매 = L2 in SGD ``` ### AdamW (decoupled, recommended) ```python optim = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01) # 매 better than Adam + weight_decay ``` ### Manual L1 in PyTorch ```python def l1_penalty(model, lam=1e-5): return lam * sum(p.abs().sum() for p in model.parameters()) loss = task_loss + l1_penalty(model) ``` ### CV-tune α (sklearn) ```python from sklearn.linear_model import RidgeCV model = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100]).fit(X, y) print(model.alpha_) ``` ### LassoCV ```python from sklearn.linear_model import LassoCV model = LassoCV(alphas=np.logspace(-4, 0, 50), cv=5).fit(X, y) ``` ### Path plot (regularization strength sweep) ```python import matplotlib.pyplot as plt alphas = np.logspace(-4, 1, 50) coefs = [] for a in alphas: coefs.append(Lasso(alpha=a).fit(X, y).coef_) plt.plot(alphas, coefs) plt.xscale('log') plt.xlabel('alpha'); plt.ylabel('coefficient') ``` ### Group L1 (group lasso) ```python def group_lasso_penalty(weights, groups, lam): total = 0 for group in groups: total += lam * np.sqrt(sum(weights[i]**2 for i in group)) return total ``` ### Different decay per layer (DL) ```python optim = torch.optim.AdamW([ {'params': model.encoder.parameters(), 'weight_decay': 0.01}, {'params': model.head.parameters(), 'weight_decay': 0.001}, ]) ``` ### Bias / norm exclude (best practice) ```python def get_param_groups(model, weight_decay): decay, no_decay = [], [] for name, p in model.named_parameters(): if p.requires_grad: if 'bias' in name or 'norm' in name: no_decay.append(p) else: decay.append(p) return [ {'params': decay, 'weight_decay': weight_decay}, {'params': no_decay, 'weight_decay': 0}, ] optim = torch.optim.AdamW(get_param_groups(model, 0.01), lr=1e-3) ``` ### Effect on bias-variance ```python def reg_effect(alphas, X_train, y_train, X_val, y_val): train_err, val_err = [], [] for a in alphas: m = Ridge(alpha=a).fit(X_train, y_train) train_err.append(((m.predict(X_train) - y_train) ** 2).mean()) val_err.append(((m.predict(X_val) - y_val) ** 2).mean()) return train_err, val_err # 매 high alpha → train ↑, val ↓ (until point) → val ↑ (over-reg) ``` ### Sparsity-induced (modern DL) ```python def magnitude_pruning(model, sparsity=0.5): """매 매 layer 의 의 의 의 magnitude bottom-x% 의 zero out.""" for name, p in model.named_parameters(): if 'weight' in name: threshold = p.abs().flatten().kthvalue(int(p.numel() * sparsity)).values p.data[p.abs() < threshold] = 0 ``` ## 매 결정 기준 | 상황 | Method | |---|---| | Linear | Ridge / Lasso / ElasticNet | | Feature selection | Lasso | | Multicollinearity | Ridge | | DL | AdamW weight decay | | Sparsity goal | Lasso / pruning | | Best DL practice | AdamW + exclude bias/norm | **기본값**: 매 DL = AdamW + 0.01-0.1 weight decay + bias/norm exclude. 매 linear = ElasticNet CV. 매 sparsity = Lasso. ## 🔗 Graph - 부모: [[L1-and-L2-Regularization|Regularization]] · [[Optimization]] - 변형: [[Lasso]] · [[Ridge]] · [[ElasticNet]] · [[Weight-Decay]] - Adjacent: [[Generalization-in-AI]] ## 🤖 LLM 활용 **언제**: 매 모든 ML training. **언제 X**: 매 underfit (no need). ## ❌ 안티패턴 - **Adam + weight_decay**: 매 use AdamW. - **Same decay for bias / norm**: 매 hurt training. - **No CV α**: 매 wrong strength. - **L1 for DL** (without sparsity goal): 매 unstable. ## 🧪 검증 / 중복 - Verified (Hastie-Tibshirani-Friedman, Loshchilov AdamW 2019). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — L1/L2 + 매 sklearn / AdamW / param groups / pruning code |