Files
2nd/10_Wiki/Topics/AI_and_ML/L1-and-L2-Regularization.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.9 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-l1-and-l2-regularization L1 and L2 Regularization 10_Wiki/Topics verified self
L1
L2
Lasso
Ridge
ElasticNet
weight decay
regularization
none A 0.97 applied
machine-learning
regularization
l1
l2
lasso
ridge
weight-decay
2026-05-10 pending
language framework
Python scikit-learn / PyTorch

L1 and L2 Regularization

매 한 줄

"매 weight 의 magnitude 의 의 의 의 penalize". L1 (Lasso) → 매 sparsity. L2 (Ridge) → 매 small. ElasticNet (combine). 매 modern: 매 weight decay (DL), 매 AdamW의 decoupled. 매 dropout 도 regularizer.

매 핵심

매 L1 (Lasso)

  • 매 penalty: λ Σ |wᵢ|.
  • 매 effect: 매 sparse solutions (zeros).
  • 매 응용: 매 feature selection.

매 L2 (Ridge)

  • 매 penalty: λ Σ wᵢ².
  • 매 effect: 매 small but non-zero.
  • 매 응용: 매 multicollinearity.

매 ElasticNet

  • α L1 + (1-α) L2.

매 modern DL

  • Weight decay (= L2).
  • AdamW: 매 decoupled weight decay (Loshchilov 2019).
  • Dropout: 매 implicit reg.
  • Batch norm: 매 implicit reg.
  • Early stopping: 매 implicit reg.

매 응용

  1. Linear regression: Ridge, Lasso.
  2. Logistic regression: 매 class_weight + L2.
  3. DL training: weight decay.
  4. Feature selection: Lasso.

💻 패턴

Ridge (sklearn)

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0).fit(X, y)

Lasso (sklearn)

from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1).fit(X, y)
print((model.coef_ == 0).sum(), 'zero coefficients')

ElasticNet

from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)

Logistic + L2

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=1.0).fit(X, y)
# 매 C = 1/alpha (inverse strength)

PyTorch weight decay

import torch
optim = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
# 매 = L2 in SGD
optim = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# 매 better than Adam + weight_decay

Manual L1 in PyTorch

def l1_penalty(model, lam=1e-5):
    return lam * sum(p.abs().sum() for p in model.parameters())

loss = task_loss + l1_penalty(model)

CV-tune α (sklearn)

from sklearn.linear_model import RidgeCV
model = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100]).fit(X, y)
print(model.alpha_)

LassoCV

from sklearn.linear_model import LassoCV
model = LassoCV(alphas=np.logspace(-4, 0, 50), cv=5).fit(X, y)

Path plot (regularization strength sweep)

import matplotlib.pyplot as plt
alphas = np.logspace(-4, 1, 50)
coefs = []
for a in alphas:
    coefs.append(Lasso(alpha=a).fit(X, y).coef_)
plt.plot(alphas, coefs)
plt.xscale('log')
plt.xlabel('alpha'); plt.ylabel('coefficient')

Group L1 (group lasso)

def group_lasso_penalty(weights, groups, lam):
    total = 0
    for group in groups:
        total += lam * np.sqrt(sum(weights[i]**2 for i in group))
    return total

Different decay per layer (DL)

optim = torch.optim.AdamW([
    {'params': model.encoder.parameters(), 'weight_decay': 0.01},
    {'params': model.head.parameters(), 'weight_decay': 0.001},
])

Bias / norm exclude (best practice)

def get_param_groups(model, weight_decay):
    decay, no_decay = [], []
    for name, p in model.named_parameters():
        if p.requires_grad:
            if 'bias' in name or 'norm' in name: no_decay.append(p)
            else: decay.append(p)
    return [
        {'params': decay, 'weight_decay': weight_decay},
        {'params': no_decay, 'weight_decay': 0},
    ]

optim = torch.optim.AdamW(get_param_groups(model, 0.01), lr=1e-3)

Effect on bias-variance

def reg_effect(alphas, X_train, y_train, X_val, y_val):
    train_err, val_err = [], []
    for a in alphas:
        m = Ridge(alpha=a).fit(X_train, y_train)
        train_err.append(((m.predict(X_train) - y_train) ** 2).mean())
        val_err.append(((m.predict(X_val) - y_val) ** 2).mean())
    return train_err, val_err
# 매 high alpha → train ↑, val ↓ (until point) → val ↑ (over-reg)

Sparsity-induced (modern DL)

def magnitude_pruning(model, sparsity=0.5):
    """매 매 layer 의 의 의 의 magnitude bottom-x% 의 zero out."""
    for name, p in model.named_parameters():
        if 'weight' in name:
            threshold = p.abs().flatten().kthvalue(int(p.numel() * sparsity)).values
            p.data[p.abs() < threshold] = 0

매 결정 기준

상황 Method
Linear Ridge / Lasso / ElasticNet
Feature selection Lasso
Multicollinearity Ridge
DL AdamW weight decay
Sparsity goal Lasso / pruning
Best DL practice AdamW + exclude bias/norm

기본값: 매 DL = AdamW + 0.01-0.1 weight decay + bias/norm exclude. 매 linear = ElasticNet CV. 매 sparsity = Lasso.

🔗 Graph

🤖 LLM 활용

언제: 매 모든 ML training. 언제 X: 매 underfit (no need).

안티패턴

  • Adam + weight_decay: 매 use AdamW.
  • Same decay for bias / norm: 매 hurt training.
  • No CV α: 매 wrong strength.
  • L1 for DL (without sparsity goal): 매 unstable.

🧪 검증 / 중복

  • Verified (Hastie-Tibshirani-Friedman, Loshchilov AdamW 2019).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — L1/L2 + 매 sklearn / AdamW / param groups / pruning code