f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.9 KiB
5.9 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-l1-and-l2-regularization | L1 and L2 Regularization | 10_Wiki/Topics | verified | self |
|
none | A | 0.97 | applied |
|
2026-05-10 | pending |
|
L1 and L2 Regularization
매 한 줄
"매 weight 의 magnitude 의 의 의 의 penalize". L1 (Lasso) → 매 sparsity. L2 (Ridge) → 매 small. ElasticNet (combine). 매 modern: 매 weight decay (DL), 매 AdamW의 decoupled. 매 dropout 도 regularizer.
매 핵심
매 L1 (Lasso)
- 매 penalty: λ Σ |wᵢ|.
- 매 effect: 매 sparse solutions (zeros).
- 매 응용: 매 feature selection.
매 L2 (Ridge)
- 매 penalty: λ Σ wᵢ².
- 매 effect: 매 small but non-zero.
- 매 응용: 매 multicollinearity.
매 ElasticNet
- 매 α L1 + (1-α) L2.
매 modern DL
- Weight decay (= L2).
- AdamW: 매 decoupled weight decay (Loshchilov 2019).
- Dropout: 매 implicit reg.
- Batch norm: 매 implicit reg.
- Early stopping: 매 implicit reg.
매 응용
- Linear regression: Ridge, Lasso.
- Logistic regression: 매 class_weight + L2.
- DL training: weight decay.
- Feature selection: Lasso.
💻 패턴
Ridge (sklearn)
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0).fit(X, y)
Lasso (sklearn)
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1).fit(X, y)
print((model.coef_ == 0).sum(), 'zero coefficients')
ElasticNet
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)
Logistic + L2
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=1.0).fit(X, y)
# 매 C = 1/alpha (inverse strength)
PyTorch weight decay
import torch
optim = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
# 매 = L2 in SGD
AdamW (decoupled, recommended)
optim = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# 매 better than Adam + weight_decay
Manual L1 in PyTorch
def l1_penalty(model, lam=1e-5):
return lam * sum(p.abs().sum() for p in model.parameters())
loss = task_loss + l1_penalty(model)
CV-tune α (sklearn)
from sklearn.linear_model import RidgeCV
model = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100]).fit(X, y)
print(model.alpha_)
LassoCV
from sklearn.linear_model import LassoCV
model = LassoCV(alphas=np.logspace(-4, 0, 50), cv=5).fit(X, y)
Path plot (regularization strength sweep)
import matplotlib.pyplot as plt
alphas = np.logspace(-4, 1, 50)
coefs = []
for a in alphas:
coefs.append(Lasso(alpha=a).fit(X, y).coef_)
plt.plot(alphas, coefs)
plt.xscale('log')
plt.xlabel('alpha'); plt.ylabel('coefficient')
Group L1 (group lasso)
def group_lasso_penalty(weights, groups, lam):
total = 0
for group in groups:
total += lam * np.sqrt(sum(weights[i]**2 for i in group))
return total
Different decay per layer (DL)
optim = torch.optim.AdamW([
{'params': model.encoder.parameters(), 'weight_decay': 0.01},
{'params': model.head.parameters(), 'weight_decay': 0.001},
])
Bias / norm exclude (best practice)
def get_param_groups(model, weight_decay):
decay, no_decay = [], []
for name, p in model.named_parameters():
if p.requires_grad:
if 'bias' in name or 'norm' in name: no_decay.append(p)
else: decay.append(p)
return [
{'params': decay, 'weight_decay': weight_decay},
{'params': no_decay, 'weight_decay': 0},
]
optim = torch.optim.AdamW(get_param_groups(model, 0.01), lr=1e-3)
Effect on bias-variance
def reg_effect(alphas, X_train, y_train, X_val, y_val):
train_err, val_err = [], []
for a in alphas:
m = Ridge(alpha=a).fit(X_train, y_train)
train_err.append(((m.predict(X_train) - y_train) ** 2).mean())
val_err.append(((m.predict(X_val) - y_val) ** 2).mean())
return train_err, val_err
# 매 high alpha → train ↑, val ↓ (until point) → val ↑ (over-reg)
Sparsity-induced (modern DL)
def magnitude_pruning(model, sparsity=0.5):
"""매 매 layer 의 의 의 의 magnitude bottom-x% 의 zero out."""
for name, p in model.named_parameters():
if 'weight' in name:
threshold = p.abs().flatten().kthvalue(int(p.numel() * sparsity)).values
p.data[p.abs() < threshold] = 0
매 결정 기준
| 상황 | Method |
|---|---|
| Linear | Ridge / Lasso / ElasticNet |
| Feature selection | Lasso |
| Multicollinearity | Ridge |
| DL | AdamW weight decay |
| Sparsity goal | Lasso / pruning |
| Best DL practice | AdamW + exclude bias/norm |
기본값: 매 DL = AdamW + 0.01-0.1 weight decay + bias/norm exclude. 매 linear = ElasticNet CV. 매 sparsity = Lasso.
🔗 Graph
- 부모: L1-and-L2-Regularization · Optimization
- 변형: Lasso · Ridge · ElasticNet · Weight-Decay
- Adjacent: Generalization-in-AI
🤖 LLM 활용
언제: 매 모든 ML training. 언제 X: 매 underfit (no need).
❌ 안티패턴
- Adam + weight_decay: 매 use AdamW.
- Same decay for bias / norm: 매 hurt training.
- No CV α: 매 wrong strength.
- L1 for DL (without sparsity goal): 매 unstable.
🧪 검증 / 중복
- Verified (Hastie-Tibshirani-Friedman, Loshchilov AdamW 2019).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — L1/L2 + 매 sklearn / AdamW / param groups / pruning code |