f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
175 lines
5.5 KiB
Markdown
175 lines
5.5 KiB
Markdown
---
|
||
id: wiki-2026-0508-stochastic-gradient-descent
|
||
title: Stochastic Gradient Descent
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [SGD, Mini-batch SGD, Stochastic Gradient Descent]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.95
|
||
verification_status: applied
|
||
tags: [machine-learning, optimization, deep-learning, gradient-descent]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: python
|
||
framework: pytorch
|
||
---
|
||
|
||
# Stochastic Gradient Descent (SGD)
|
||
|
||
## 매 한 줄
|
||
> **"매 한 sample (or mini-batch) 에 대한 gradient 로 매 step — 매 noisy 하지만 매 cheap, 매 escape from local minima"**. Robbins & Monro (1951) 의 stochastic approximation 의 후예. 2026 deep learning 의 foundation — 매 SGD+momentum, AdamW, Lion 가 매 default.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 vs full-batch
|
||
- **Batch GD**: 매 entire dataset gradient — 매 expensive, deterministic.
|
||
- **SGD (online)**: 매 single sample — 매 noisy, fast.
|
||
- **Mini-batch SGD**: 매 32–4096 samples — 매 modern default. 매 GPU 의 vectorize.
|
||
|
||
### 매 update rule
|
||
- Vanilla SGD: `θ ← θ − η ∇L(θ; x_i, y_i)`.
|
||
- Momentum: `v ← μv + ∇L; θ ← θ − ηv`.
|
||
- Nesterov: 매 lookahead momentum.
|
||
|
||
### 매 modern variants
|
||
- **AdamW** (Loshchilov 2019): adaptive lr + decoupled weight decay — 매 LLM/transformer default.
|
||
- **Lion** (Chen 2023): sign-based momentum — 매 less memory, comparable.
|
||
- **Sophia** (2023): second-order — 매 LLM pretrain.
|
||
- **Muon** (Jordan 2024): orthogonalized momentum — 매 emerging.
|
||
|
||
### 매 응용
|
||
1. Neural network training (all of deep learning).
|
||
2. Logistic regression, linear regression at scale.
|
||
3. Online learning / streaming data.
|
||
|
||
## 💻 패턴
|
||
|
||
### PyTorch 2.5 — SGD with momentum
|
||
```python
|
||
import torch
|
||
from torch import nn, optim
|
||
|
||
model = nn.Linear(784, 10)
|
||
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
|
||
loss_fn = nn.CrossEntropyLoss()
|
||
|
||
for epoch in range(10):
|
||
for x, y in dataloader:
|
||
optimizer.zero_grad()
|
||
loss = loss_fn(model(x), y)
|
||
loss.backward()
|
||
optimizer.step()
|
||
```
|
||
|
||
### AdamW (transformer default 2026)
|
||
```python
|
||
optimizer = optim.AdamW(
|
||
model.parameters(),
|
||
lr=3e-4,
|
||
betas=(0.9, 0.95),
|
||
weight_decay=0.1,
|
||
fused=True, # 매 GPU fused kernel
|
||
)
|
||
```
|
||
|
||
### Cosine LR schedule
|
||
```python
|
||
from torch.optim.lr_scheduler import CosineAnnealingLR
|
||
scheduler = CosineAnnealingLR(optimizer, T_max=num_steps, eta_min=1e-6)
|
||
for step in range(num_steps):
|
||
train_step()
|
||
optimizer.step()
|
||
scheduler.step()
|
||
```
|
||
|
||
### Linear warmup + cosine decay (LLM standard)
|
||
```python
|
||
def lr_lambda(step):
|
||
if step < warmup:
|
||
return step / warmup
|
||
progress = (step - warmup) / (total - warmup)
|
||
return 0.5 * (1 + math.cos(math.pi * progress))
|
||
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
|
||
```
|
||
|
||
### Gradient clipping (stability)
|
||
```python
|
||
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
|
||
optimizer.step()
|
||
```
|
||
|
||
### Mixed precision SGD (bf16, H100)
|
||
```python
|
||
scaler = torch.amp.GradScaler("cuda")
|
||
with torch.autocast("cuda", dtype=torch.bfloat16):
|
||
loss = loss_fn(model(x), y)
|
||
scaler.scale(loss).backward()
|
||
scaler.unscale_(optimizer)
|
||
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
|
||
scaler.step(optimizer)
|
||
scaler.update()
|
||
```
|
||
|
||
### Pure NumPy SGD (linear regression)
|
||
```python
|
||
import numpy as np
|
||
def sgd(X, y, lr=0.01, epochs=100, batch=32):
|
||
w = np.zeros(X.shape[1])
|
||
for _ in range(epochs):
|
||
idx = np.random.permutation(len(X))
|
||
for i in range(0, len(X), batch):
|
||
b = idx[i:i+batch]
|
||
grad = X[b].T @ (X[b] @ w - y[b]) / len(b)
|
||
w -= lr * grad
|
||
return w
|
||
```
|
||
|
||
### Lion optimizer (2026 alt)
|
||
```python
|
||
# pip install lion-pytorch
|
||
from lion_pytorch import Lion
|
||
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.01)
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| Image classification (ResNet, ViT) | SGD + momentum + cosine |
|
||
| LLM / Transformer training | AdamW + linear warmup + cosine |
|
||
| Memory-constrained large model | Lion or 8-bit Adam (bitsandbytes) |
|
||
| Convex optimization, theoretical guarantee | Vanilla SGD with decreasing lr |
|
||
| Online streaming data | Mini-batch SGD, lr ~ 1/sqrt(t) |
|
||
|
||
**기본값**: 매 transformer/LLM → AdamW 3e-4 + warmup 1k steps + cosine. 매 CNN → SGD 0.1 + momentum 0.9 + cosine.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Gradient Descent]] · [[Optimization]]
|
||
- 변형: [[Adam]] · [[AdamW]]
|
||
- 응용: [[Deep Learning]]
|
||
- Adjacent: [[Gradient Clipping]] · [[데이터_사이언스_및_ML_엔지니어링|Backpropagation]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 model training의 default optimizer choice; debug convergence (loss spike, plateau).
|
||
**언제 X**: 매 closed-form solution exists (small linear regression — use normal equation); 매 second-order necessary (small classical ML).
|
||
|
||
## ❌ 안티패턴
|
||
- **lr too high**: 매 loss explosion / NaN. 매 warmup + clip.
|
||
- **No weight decay**: 매 overfitting.
|
||
- **Momentum with lr too high**: 매 oscillation.
|
||
- **AdamW lr=1e-3 for LLM**: 매 too high — 1e-4 ~ 3e-4 가 매 standard.
|
||
- **Batch size 1 on GPU**: 매 underutilization. 매 32+ 의 사용.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (PyTorch docs 2.5; Goodfellow *Deep Learning* ch.8; Loshchilov AdamW 2019).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — SGD + modern variants (AdamW, Lion, Muon) for 2026 |
|