Files
2nd/10_Wiki/Topics/Architecture/Stochastic-Gradient-Descent.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

175 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-stochastic-gradient-descent
title: Stochastic Gradient Descent
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [SGD, Mini-batch SGD, Stochastic Gradient Descent]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [machine-learning, optimization, deep-learning, gradient-descent]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pytorch
---
# Stochastic Gradient Descent (SGD)
## 매 한 줄
> **"매 한 sample (or mini-batch) 에 대한 gradient 로 매 step — 매 noisy 하지만 매 cheap, 매 escape from local minima"**. Robbins & Monro (1951) 의 stochastic approximation 의 후예. 2026 deep learning 의 foundation — 매 SGD+momentum, AdamW, Lion 가 매 default.
## 매 핵심
### 매 vs full-batch
- **Batch GD**: 매 entire dataset gradient — 매 expensive, deterministic.
- **SGD (online)**: 매 single sample — 매 noisy, fast.
- **Mini-batch SGD**: 매 324096 samples — 매 modern default. 매 GPU 의 vectorize.
### 매 update rule
- Vanilla SGD: `θ ← θ η ∇L(θ; x_i, y_i)`.
- Momentum: `v ← μv + ∇L; θ ← θ − ηv`.
- Nesterov: 매 lookahead momentum.
### 매 modern variants
- **AdamW** (Loshchilov 2019): adaptive lr + decoupled weight decay — 매 LLM/transformer default.
- **Lion** (Chen 2023): sign-based momentum — 매 less memory, comparable.
- **Sophia** (2023): second-order — 매 LLM pretrain.
- **Muon** (Jordan 2024): orthogonalized momentum — 매 emerging.
### 매 응용
1. Neural network training (all of deep learning).
2. Logistic regression, linear regression at scale.
3. Online learning / streaming data.
## 💻 패턴
### PyTorch 2.5 — SGD with momentum
```python
import torch
from torch import nn, optim
model = nn.Linear(784, 10)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(10):
for x, y in dataloader:
optimizer.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
optimizer.step()
```
### AdamW (transformer default 2026)
```python
optimizer = optim.AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.95),
weight_decay=0.1,
fused=True, # 매 GPU fused kernel
)
```
### Cosine LR schedule
```python
from torch.optim.lr_scheduler import CosineAnnealingLR
scheduler = CosineAnnealingLR(optimizer, T_max=num_steps, eta_min=1e-6)
for step in range(num_steps):
train_step()
optimizer.step()
scheduler.step()
```
### Linear warmup + cosine decay (LLM standard)
```python
def lr_lambda(step):
if step < warmup:
return step / warmup
progress = (step - warmup) / (total - warmup)
return 0.5 * (1 + math.cos(math.pi * progress))
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
```
### Gradient clipping (stability)
```python
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
```
### Mixed precision SGD (bf16, H100)
```python
scaler = torch.amp.GradScaler("cuda")
with torch.autocast("cuda", dtype=torch.bfloat16):
loss = loss_fn(model(x), y)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
```
### Pure NumPy SGD (linear regression)
```python
import numpy as np
def sgd(X, y, lr=0.01, epochs=100, batch=32):
w = np.zeros(X.shape[1])
for _ in range(epochs):
idx = np.random.permutation(len(X))
for i in range(0, len(X), batch):
b = idx[i:i+batch]
grad = X[b].T @ (X[b] @ w - y[b]) / len(b)
w -= lr * grad
return w
```
### Lion optimizer (2026 alt)
```python
# pip install lion-pytorch
from lion_pytorch import Lion
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.01)
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Image classification (ResNet, ViT) | SGD + momentum + cosine |
| LLM / Transformer training | AdamW + linear warmup + cosine |
| Memory-constrained large model | Lion or 8-bit Adam (bitsandbytes) |
| Convex optimization, theoretical guarantee | Vanilla SGD with decreasing lr |
| Online streaming data | Mini-batch SGD, lr ~ 1/sqrt(t) |
**기본값**: 매 transformer/LLM → AdamW 3e-4 + warmup 1k steps + cosine. 매 CNN → SGD 0.1 + momentum 0.9 + cosine.
## 🔗 Graph
- 부모: [[Gradient Descent]] · [[Optimization]]
- 변형: [[Adam]] · [[AdamW]]
- 응용: [[Deep Learning]]
- Adjacent: [[Gradient Clipping]] · [[데이터_사이언스_및_ML_엔지니어링|Backpropagation]]
## 🤖 LLM 활용
**언제**: 매 model training의 default optimizer choice; debug convergence (loss spike, plateau).
**언제 X**: 매 closed-form solution exists (small linear regression — use normal equation); 매 second-order necessary (small classical ML).
## ❌ 안티패턴
- **lr too high**: 매 loss explosion / NaN. 매 warmup + clip.
- **No weight decay**: 매 overfitting.
- **Momentum with lr too high**: 매 oscillation.
- **AdamW lr=1e-3 for LLM**: 매 too high — 1e-4 ~ 3e-4 가 매 standard.
- **Batch size 1 on GPU**: 매 underutilization. 매 32+ 의 사용.
## 🧪 검증 / 중복
- Verified (PyTorch docs 2.5; Goodfellow *Deep Learning* ch.8; Loshchilov AdamW 2019).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — SGD + modern variants (AdamW, Lion, Muon) for 2026 |