Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

5.9 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

L1 and L2 Regularization

매 한 줄

"매 weight 의 magnitude 의 의 의 의 penalize". L1 (Lasso) → 매 sparsity. L2 (Ridge) → 매 small. ElasticNet (combine). 매 modern: 매 weight decay (DL), 매 AdamW의 decoupled. 매 dropout 도 regularizer.

매 핵심

매 L1 (Lasso)

매 penalty: λ Σ |wᵢ|.
매 effect: 매 sparse solutions (zeros).
매 응용: 매 feature selection.

매 L2 (Ridge)

매 penalty: λ Σ wᵢ².
매 effect: 매 small but non-zero.
매 응용: 매 multicollinearity.

매 ElasticNet

매 α L1 + (1-α) L2.

매 modern DL

Weight decay (= L2).
AdamW: 매 decoupled weight decay (Loshchilov 2019).
Dropout: 매 implicit reg.
Batch norm: 매 implicit reg.
Early stopping: 매 implicit reg.

매 응용

Linear regression: Ridge, Lasso.
Logistic regression: 매 class_weight + L2.
DL training: weight decay.
Feature selection: Lasso.

💻 패턴

Ridge (sklearn)

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0).fit(X, y)

Lasso (sklearn)

from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1).fit(X, y)
print((model.coef_ == 0).sum(), 'zero coefficients')

ElasticNet

from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y)

Logistic + L2

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2', C=1.0).fit(X, y)
# 매 C = 1/alpha (inverse strength)

PyTorch weight decay

import torch
optim = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)
# 매 = L2 in SGD

AdamW (decoupled, recommended)

optim = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
# 매 better than Adam + weight_decay

Manual L1 in PyTorch

def l1_penalty(model, lam=1e-5):
    return lam * sum(p.abs().sum() for p in model.parameters())

loss = task_loss + l1_penalty(model)

CV-tune α (sklearn)

from sklearn.linear_model import RidgeCV
model = RidgeCV(alphas=[0.01, 0.1, 1, 10, 100]).fit(X, y)
print(model.alpha_)

LassoCV

from sklearn.linear_model import LassoCV
model = LassoCV(alphas=np.logspace(-4, 0, 50), cv=5).fit(X, y)

Path plot (regularization strength sweep)

import matplotlib.pyplot as plt
alphas = np.logspace(-4, 1, 50)
coefs = []
for a in alphas:
    coefs.append(Lasso(alpha=a).fit(X, y).coef_)
plt.plot(alphas, coefs)
plt.xscale('log')
plt.xlabel('alpha'); plt.ylabel('coefficient')

Group L1 (group lasso)

def group_lasso_penalty(weights, groups, lam):
    total = 0
    for group in groups:
        total += lam * np.sqrt(sum(weights[i]**2 for i in group))
    return total

Different decay per layer (DL)

optim = torch.optim.AdamW([
    {'params': model.encoder.parameters(), 'weight_decay': 0.01},
    {'params': model.head.parameters(), 'weight_decay': 0.001},
])

Bias / norm exclude (best practice)

def get_param_groups(model, weight_decay):
    decay, no_decay = [], []
    for name, p in model.named_parameters():
        if p.requires_grad:
            if 'bias' in name or 'norm' in name: no_decay.append(p)
            else: decay.append(p)
    return [
        {'params': decay, 'weight_decay': weight_decay},
        {'params': no_decay, 'weight_decay': 0},
    ]

optim = torch.optim.AdamW(get_param_groups(model, 0.01), lr=1e-3)

Effect on bias-variance

def reg_effect(alphas, X_train, y_train, X_val, y_val):
    train_err, val_err = [], []
    for a in alphas:
        m = Ridge(alpha=a).fit(X_train, y_train)
        train_err.append(((m.predict(X_train) - y_train) ** 2).mean())
        val_err.append(((m.predict(X_val) - y_val) ** 2).mean())
    return train_err, val_err
# 매 high alpha → train ↑, val ↓ (until point) → val ↑ (over-reg)

Sparsity-induced (modern DL)

def magnitude_pruning(model, sparsity=0.5):
    """매 매 layer 의 의 의 의 magnitude bottom-x% 의 zero out."""
    for name, p in model.named_parameters():
        if 'weight' in name:
            threshold = p.abs().flatten().kthvalue(int(p.numel() * sparsity)).values
            p.data[p.abs() < threshold] = 0

매 결정 기준

상황	Method
Linear	Ridge / Lasso / ElasticNet
Feature selection	Lasso
Multicollinearity	Ridge
DL	AdamW weight decay
Sparsity goal	Lasso / pruning
Best DL practice	AdamW + exclude bias/norm

기본값: 매 DL = AdamW + 0.01-0.1 weight decay + bias/norm exclude. 매 linear = ElasticNet CV. 매 sparsity = Lasso.

🔗 Graph

부모: L1-and-L2-Regularization · Optimization
변형: Lasso · Ridge · ElasticNet · Weight-Decay
Adjacent: Generalization-in-AI

🤖 LLM 활용

언제: 매 모든 ML training. 언제 X: 매 underfit (no need).

❌ 안티패턴

Adam + weight_decay: 매 use AdamW.
Same decay for bias / norm: 매 hurt training.
No CV α: 매 wrong strength.
L1 for DL (without sparsity goal): 매 unstable.

🧪 검증 / 중복

Verified (Hastie-Tibshirani-Friedman, Loshchilov AdamW 2019).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — L1/L2 + 매 sklearn / AdamW / param groups / pruning code

5.9 KiB Raw Blame History Unescape Escape