"매 한 sample (or mini-batch) 에 대한 gradient 로 매 step — 매 noisy 하지만 매 cheap, 매 escape from local minima". Robbins & Monro (1951) 의 stochastic approximation 의 후예. 2026 deep learning 의 foundation — 매 SGD+momentum, AdamW, Lion 가 매 default.
매 핵심
매 vs full-batch
Batch GD: 매 entire dataset gradient — 매 expensive, deterministic.
SGD (online): 매 single sample — 매 noisy, fast.
Mini-batch SGD: 매 32–4096 samples — 매 modern default. 매 GPU 의 vectorize.
매 update rule
Vanilla SGD: θ ← θ − η ∇L(θ; x_i, y_i).
Momentum: v ← μv + ∇L; θ ← θ − ηv.
Nesterov: 매 lookahead momentum.
매 modern variants
AdamW (Loshchilov 2019): adaptive lr + decoupled weight decay — 매 LLM/transformer default.
Lion (Chen 2023): sign-based momentum — 매 less memory, comparable.
Sophia (2023): second-order — 매 LLM pretrain.
Muon (Jordan 2024): orthogonalized momentum — 매 emerging.
언제: 매 model training의 default optimizer choice; debug convergence (loss spike, plateau).
언제 X: 매 closed-form solution exists (small linear regression — use normal equation); 매 second-order necessary (small classical ML).
❌ 안티패턴
lr too high: 매 loss explosion / NaN. 매 warmup + clip.
No weight decay: 매 overfitting.
Momentum with lr too high: 매 oscillation.
AdamW lr=1e-3 for LLM: 매 too high — 1e-4 ~ 3e-4 가 매 standard.
Batch size 1 on GPU: 매 underutilization. 매 32+ 의 사용.