Files
2nd/10_Wiki/Topics/Architecture/Stochastic-Gradient-Descent.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

5.4 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-stochastic-gradient-descent Stochastic Gradient Descent 10_Wiki/Topics verified self
SGD
Mini-batch SGD
Stochastic Gradient Descent
none A 0.95 applied
machine-learning
optimization
deep-learning
gradient-descent
2026-05-10 pending
language framework
python pytorch

Stochastic Gradient Descent (SGD)

매 한 줄

"매 한 sample (or mini-batch) 에 대한 gradient 로 매 step — 매 noisy 하지만 매 cheap, 매 escape from local minima". Robbins & Monro (1951) 의 stochastic approximation 의 후예. 2026 deep learning 의 foundation — 매 SGD+momentum, AdamW, Lion 가 매 default.

매 핵심

매 vs full-batch

  • Batch GD: 매 entire dataset gradient — 매 expensive, deterministic.
  • SGD (online): 매 single sample — 매 noisy, fast.
  • Mini-batch SGD: 매 324096 samples — 매 modern default. 매 GPU 의 vectorize.

매 update rule

  • Vanilla SGD: θ ← θ η ∇L(θ; x_i, y_i).
  • Momentum: v ← μv + ∇L; θ ← θ − ηv.
  • Nesterov: 매 lookahead momentum.

매 modern variants

  • AdamW (Loshchilov 2019): adaptive lr + decoupled weight decay — 매 LLM/transformer default.
  • Lion (Chen 2023): sign-based momentum — 매 less memory, comparable.
  • Sophia (2023): second-order — 매 LLM pretrain.
  • Muon (Jordan 2024): orthogonalized momentum — 매 emerging.

매 응용

  1. Neural network training (all of deep learning).
  2. Logistic regression, linear regression at scale.
  3. Online learning / streaming data.

💻 패턴

PyTorch 2.5 — SGD with momentum

import torch
from torch import nn, optim

model = nn.Linear(784, 10)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(10):
    for x, y in dataloader:
        optimizer.zero_grad()
        loss = loss_fn(model(x), y)
        loss.backward()
        optimizer.step()

AdamW (transformer default 2026)

optimizer = optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
    fused=True,  # 매 GPU fused kernel
)

Cosine LR schedule

from torch.optim.lr_scheduler import CosineAnnealingLR
scheduler = CosineAnnealingLR(optimizer, T_max=num_steps, eta_min=1e-6)
for step in range(num_steps):
    train_step()
    optimizer.step()
    scheduler.step()

Linear warmup + cosine decay (LLM standard)

def lr_lambda(step):
    if step < warmup:
        return step / warmup
    progress = (step - warmup) / (total - warmup)
    return 0.5 * (1 + math.cos(math.pi * progress))
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

Gradient clipping (stability)

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Mixed precision SGD (bf16, H100)

scaler = torch.amp.GradScaler("cuda")
with torch.autocast("cuda", dtype=torch.bfloat16):
    loss = loss_fn(model(x), y)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()

Pure NumPy SGD (linear regression)

import numpy as np
def sgd(X, y, lr=0.01, epochs=100, batch=32):
    w = np.zeros(X.shape[1])
    for _ in range(epochs):
        idx = np.random.permutation(len(X))
        for i in range(0, len(X), batch):
            b = idx[i:i+batch]
            grad = X[b].T @ (X[b] @ w - y[b]) / len(b)
            w -= lr * grad
    return w

Lion optimizer (2026 alt)

# pip install lion-pytorch
from lion_pytorch import Lion
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.01)

매 결정 기준

상황 Approach
Image classification (ResNet, ViT) SGD + momentum + cosine
LLM / Transformer training AdamW + linear warmup + cosine
Memory-constrained large model Lion or 8-bit Adam (bitsandbytes)
Convex optimization, theoretical guarantee Vanilla SGD with decreasing lr
Online streaming data Mini-batch SGD, lr ~ 1/sqrt(t)

기본값: 매 transformer/LLM → AdamW 3e-4 + warmup 1k steps + cosine. 매 CNN → SGD 0.1 + momentum 0.9 + cosine.

🔗 Graph

🤖 LLM 활용

언제: 매 model training의 default optimizer choice; debug convergence (loss spike, plateau). 언제 X: 매 closed-form solution exists (small linear regression — use normal equation); 매 second-order necessary (small classical ML).

안티패턴

  • lr too high: 매 loss explosion / NaN. 매 warmup + clip.
  • No weight decay: 매 overfitting.
  • Momentum with lr too high: 매 oscillation.
  • AdamW lr=1e-3 for LLM: 매 too high — 1e-4 ~ 3e-4 가 매 standard.
  • Batch size 1 on GPU: 매 underutilization. 매 32+ 의 사용.

🧪 검증 / 중복

  • Verified (PyTorch docs 2.5; Goodfellow Deep Learning ch.8; Loshchilov AdamW 2019).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — SGD + modern variants (AdamW, Lion, Muon) for 2026