Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

5.4 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Stochastic Gradient Descent (SGD)

매 한 줄

"매 한 sample (or mini-batch) 에 대한 gradient 로 매 step — 매 noisy 하지만 매 cheap, 매 escape from local minima". Robbins & Monro (1951) 의 stochastic approximation 의 후예. 2026 deep learning 의 foundation — 매 SGD+momentum, AdamW, Lion 가 매 default.

매 핵심

매 vs full-batch

Batch GD: 매 entire dataset gradient — 매 expensive, deterministic.
SGD (online): 매 single sample — 매 noisy, fast.
Mini-batch SGD: 매 32–4096 samples — 매 modern default. 매 GPU 의 vectorize.

매 update rule

Vanilla SGD: θ ← θ − η ∇L(θ; x_i, y_i).
Momentum: v ← μv + ∇L; θ ← θ − ηv.
Nesterov: 매 lookahead momentum.

매 modern variants

AdamW (Loshchilov 2019): adaptive lr + decoupled weight decay — 매 LLM/transformer default.
Lion (Chen 2023): sign-based momentum — 매 less memory, comparable.
Sophia (2023): second-order — 매 LLM pretrain.
Muon (Jordan 2024): orthogonalized momentum — 매 emerging.

매 응용

Neural network training (all of deep learning).
Logistic regression, linear regression at scale.
Online learning / streaming data.

💻 패턴

PyTorch 2.5 — SGD with momentum

import torch
from torch import nn, optim

model = nn.Linear(784, 10)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(10):
    for x, y in dataloader:
        optimizer.zero_grad()
        loss = loss_fn(model(x), y)
        loss.backward()
        optimizer.step()

AdamW (transformer default 2026)

optimizer = optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
    fused=True,  # 매 GPU fused kernel
)

Cosine LR schedule

from torch.optim.lr_scheduler import CosineAnnealingLR
scheduler = CosineAnnealingLR(optimizer, T_max=num_steps, eta_min=1e-6)
for step in range(num_steps):
    train_step()
    optimizer.step()
    scheduler.step()

Linear warmup + cosine decay (LLM standard)

def lr_lambda(step):
    if step < warmup:
        return step / warmup
    progress = (step - warmup) / (total - warmup)
    return 0.5 * (1 + math.cos(math.pi * progress))
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

Gradient clipping (stability)

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Mixed precision SGD (bf16, H100)

scaler = torch.amp.GradScaler("cuda")
with torch.autocast("cuda", dtype=torch.bfloat16):
    loss = loss_fn(model(x), y)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()

Pure NumPy SGD (linear regression)

import numpy as np
def sgd(X, y, lr=0.01, epochs=100, batch=32):
    w = np.zeros(X.shape[1])
    for _ in range(epochs):
        idx = np.random.permutation(len(X))
        for i in range(0, len(X), batch):
            b = idx[i:i+batch]
            grad = X[b].T @ (X[b] @ w - y[b]) / len(b)
            w -= lr * grad
    return w

Lion optimizer (2026 alt)

# pip install lion-pytorch
from lion_pytorch import Lion
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.01)

매 결정 기준

상황	Approach
Image classification (ResNet, ViT)	SGD + momentum + cosine
LLM / Transformer training	AdamW + linear warmup + cosine
Memory-constrained large model	Lion or 8-bit Adam (bitsandbytes)
Convex optimization, theoretical guarantee	Vanilla SGD with decreasing lr
Online streaming data	Mini-batch SGD, lr ~ 1/sqrt(t)

기본값: 매 transformer/LLM → AdamW 3e-4 + warmup 1k steps + cosine. 매 CNN → SGD 0.1 + momentum 0.9 + cosine.

🔗 Graph

부모: Gradient Descent · Optimization
변형: Adam · AdamW
응용: Deep Learning
Adjacent: Gradient Clipping · 데이터 사이언스 및 ML 엔지니어링

🤖 LLM 활용

언제: 매 model training의 default optimizer choice; debug convergence (loss spike, plateau). 언제 X: 매 closed-form solution exists (small linear regression — use normal equation); 매 second-order necessary (small classical ML).

❌ 안티패턴

lr too high: 매 loss explosion / NaN. 매 warmup + clip.
No weight decay: 매 overfitting.
Momentum with lr too high: 매 oscillation.
AdamW lr=1e-3 for LLM: 매 too high — 1e-4 ~ 3e-4 가 매 standard.
Batch size 1 on GPU: 매 underutilization. 매 32+ 의 사용.

🧪 검증 / 중복

Verified (PyTorch docs 2.5; Goodfellow Deep Learning ch.8; Loshchilov AdamW 2019).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — SGD + modern variants (AdamW, Lion, Muon) for 2026

5.4 KiB Raw Blame History Unescape Escape