--- id: wiki-2026-0508-stochastic-gradient-descent title: Stochastic Gradient Descent category: 10_Wiki/Topics status: verified canonical_id: self aliases: [SGD, Mini-batch SGD, Stochastic Gradient Descent] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [machine-learning, optimization, deep-learning, gradient-descent] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch --- # Stochastic Gradient Descent (SGD) ## 매 한 줄 > **"매 한 sample (or mini-batch) 에 대한 gradient 로 매 step — 매 noisy 하지만 매 cheap, 매 escape from local minima"**. Robbins & Monro (1951) 의 stochastic approximation 의 후예. 2026 deep learning 의 foundation — 매 SGD+momentum, AdamW, Lion 가 매 default. ## 매 핵심 ### 매 vs full-batch - **Batch GD**: 매 entire dataset gradient — 매 expensive, deterministic. - **SGD (online)**: 매 single sample — 매 noisy, fast. - **Mini-batch SGD**: 매 32–4096 samples — 매 modern default. 매 GPU 의 vectorize. ### 매 update rule - Vanilla SGD: `θ ← θ − η ∇L(θ; x_i, y_i)`. - Momentum: `v ← μv + ∇L; θ ← θ − ηv`. - Nesterov: 매 lookahead momentum. ### 매 modern variants - **AdamW** (Loshchilov 2019): adaptive lr + decoupled weight decay — 매 LLM/transformer default. - **Lion** (Chen 2023): sign-based momentum — 매 less memory, comparable. - **Sophia** (2023): second-order — 매 LLM pretrain. - **Muon** (Jordan 2024): orthogonalized momentum — 매 emerging. ### 매 응용 1. Neural network training (all of deep learning). 2. Logistic regression, linear regression at scale. 3. Online learning / streaming data. ## 💻 패턴 ### PyTorch 2.5 — SGD with momentum ```python import torch from torch import nn, optim model = nn.Linear(784, 10) optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True) loss_fn = nn.CrossEntropyLoss() for epoch in range(10): for x, y in dataloader: optimizer.zero_grad() loss = loss_fn(model(x), y) loss.backward() optimizer.step() ``` ### AdamW (transformer default 2026) ```python optimizer = optim.AdamW( model.parameters(), lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1, fused=True, # 매 GPU fused kernel ) ``` ### Cosine LR schedule ```python from torch.optim.lr_scheduler import CosineAnnealingLR scheduler = CosineAnnealingLR(optimizer, T_max=num_steps, eta_min=1e-6) for step in range(num_steps): train_step() optimizer.step() scheduler.step() ``` ### Linear warmup + cosine decay (LLM standard) ```python def lr_lambda(step): if step < warmup: return step / warmup progress = (step - warmup) / (total - warmup) return 0.5 * (1 + math.cos(math.pi * progress)) scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda) ``` ### Gradient clipping (stability) ```python torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() ``` ### Mixed precision SGD (bf16, H100) ```python scaler = torch.amp.GradScaler("cuda") with torch.autocast("cuda", dtype=torch.bfloat16): loss = loss_fn(model(x), y) scaler.scale(loss).backward() scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) scaler.step(optimizer) scaler.update() ``` ### Pure NumPy SGD (linear regression) ```python import numpy as np def sgd(X, y, lr=0.01, epochs=100, batch=32): w = np.zeros(X.shape[1]) for _ in range(epochs): idx = np.random.permutation(len(X)) for i in range(0, len(X), batch): b = idx[i:i+batch] grad = X[b].T @ (X[b] @ w - y[b]) / len(b) w -= lr * grad return w ``` ### Lion optimizer (2026 alt) ```python # pip install lion-pytorch from lion_pytorch import Lion optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.01) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Image classification (ResNet, ViT) | SGD + momentum + cosine | | LLM / Transformer training | AdamW + linear warmup + cosine | | Memory-constrained large model | Lion or 8-bit Adam (bitsandbytes) | | Convex optimization, theoretical guarantee | Vanilla SGD with decreasing lr | | Online streaming data | Mini-batch SGD, lr ~ 1/sqrt(t) | **기본값**: 매 transformer/LLM → AdamW 3e-4 + warmup 1k steps + cosine. 매 CNN → SGD 0.1 + momentum 0.9 + cosine. ## 🔗 Graph - 부모: [[Gradient Descent]] · [[Optimization]] - 변형: [[Adam]] · [[AdamW]] - 응용: [[Deep Learning]] - Adjacent: [[Gradient Clipping]] · [[데이터 사이언스 및 ML 엔지니어링|Backpropagation]] ## 🤖 LLM 활용 **언제**: 매 model training의 default optimizer choice; debug convergence (loss spike, plateau). **언제 X**: 매 closed-form solution exists (small linear regression — use normal equation); 매 second-order necessary (small classical ML). ## ❌ 안티패턴 - **lr too high**: 매 loss explosion / NaN. 매 warmup + clip. - **No weight decay**: 매 overfitting. - **Momentum with lr too high**: 매 oscillation. - **AdamW lr=1e-3 for LLM**: 매 too high — 1e-4 ~ 3e-4 가 매 standard. - **Batch size 1 on GPU**: 매 underutilization. 매 32+ 의 사용. ## 🧪 검증 / 중복 - Verified (PyTorch docs 2.5; Goodfellow *Deep Learning* ch.8; Loshchilov AdamW 2019). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — SGD + modern variants (AdamW, Lion, Muon) for 2026 |