2nd/10_Wiki/Topics/Architecture/Stochastic-Gradient-Descent.md

---
id: wiki-2026-0508-stochastic-gradient-descent
title: Stochastic Gradient Descent
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [SGD, Mini-batch SGD, Stochastic Gradient Descent]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [machine-learning, optimization, deep-learning, gradient-descent]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: pytorch
---

# Stochastic Gradient Descent (SGD)

## 매 한 줄
> **"매 한 sample (or mini-batch) 에 대한 gradient 로 매 step — 매 noisy 하지만 매 cheap, 매 escape from local minima"**. Robbins & Monro (1951) 의 stochastic approximation 의 후예. 2026 deep learning 의 foundation — 매 SGD+momentum, AdamW, Lion 가 매 default.

## 매 핵심

### 매 vs full-batch
- **Batch GD**: 매 entire dataset gradient — 매 expensive, deterministic.
- **SGD (online)**: 매 single sample — 매 noisy, fast.
- **Mini-batch SGD**: 매 32–4096 samples — 매 modern default. 매 GPU 의 vectorize.

### 매 update rule
- Vanilla SGD: `θ ← θ − η ∇L(θ; x_i, y_i)`.
- Momentum: `v ← μv + ∇L; θ ← θ − ηv`.
- Nesterov: 매 lookahead momentum.

### 매 modern variants
- **AdamW** (Loshchilov 2019): adaptive lr + decoupled weight decay — 매 LLM/transformer default.
- **Lion** (Chen 2023): sign-based momentum — 매 less memory, comparable.
- **Sophia** (2023): second-order — 매 LLM pretrain.
- **Muon** (Jordan 2024): orthogonalized momentum — 매 emerging.

### 매 응용
1. Neural network training (all of deep learning).
2. Logistic regression, linear regression at scale.
3. Online learning / streaming data.

## 💻 패턴

### PyTorch 2.5 — SGD with momentum
```python
import torch
from torch import nn, optim

model = nn.Linear(784, 10)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(10):
    for x, y in dataloader:
        optimizer.zero_grad()
        loss = loss_fn(model(x), y)
        loss.backward()
        optimizer.step()
```

### AdamW (transformer default 2026)
```python
optimizer = optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
    fused=True,  # 매 GPU fused kernel
)
```

### Cosine LR schedule
```python
from torch.optim.lr_scheduler import CosineAnnealingLR
scheduler = CosineAnnealingLR(optimizer, T_max=num_steps, eta_min=1e-6)
for step in range(num_steps):
    train_step()
    optimizer.step()
    scheduler.step()
```

### Linear warmup + cosine decay (LLM standard)
```python
def lr_lambda(step):
    if step < warmup:
        return step / warmup
    progress = (step - warmup) / (total - warmup)
    return 0.5 * (1 + math.cos(math.pi * progress))
scheduler = optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
```

### Gradient clipping (stability)
```python
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
```

### Mixed precision SGD (bf16, H100)
```python
scaler = torch.amp.GradScaler("cuda")
with torch.autocast("cuda", dtype=torch.bfloat16):
    loss = loss_fn(model(x), y)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
```

### Pure NumPy SGD (linear regression)
```python
import numpy as np
def sgd(X, y, lr=0.01, epochs=100, batch=32):
    w = np.zeros(X.shape[1])
    for _ in range(epochs):
        idx = np.random.permutation(len(X))
        for i in range(0, len(X), batch):
            b = idx[i:i+batch]
            grad = X[b].T @ (X[b] @ w - y[b]) / len(b)
            w -= lr * grad
    return w
```

### Lion optimizer (2026 alt)
```python
# pip install lion-pytorch
from lion_pytorch import Lion
optimizer = Lion(model.parameters(), lr=1e-4, weight_decay=0.01)
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Image classification (ResNet, ViT) | SGD + momentum + cosine |
| LLM / Transformer training | AdamW + linear warmup + cosine |
| Memory-constrained large model | Lion or 8-bit Adam (bitsandbytes) |
| Convex optimization, theoretical guarantee | Vanilla SGD with decreasing lr |
| Online streaming data | Mini-batch SGD, lr ~ 1/sqrt(t) |

**기본값**: 매 transformer/LLM → AdamW 3e-4 + warmup 1k steps + cosine. 매 CNN → SGD 0.1 + momentum 0.9 + cosine.

## 🔗 Graph
- 부모: [[Gradient Descent]] · [[Optimization]]
- 변형: [[Adam]] · [[AdamW]]
- 응용: [[Deep Learning]]
- Adjacent: [[Gradient Clipping]] · [[데이터_사이언스_및_ML_엔지니어링|Backpropagation]]

## 🤖 LLM 활용
**언제**: 매 model training의 default optimizer choice; debug convergence (loss spike, plateau).
**언제 X**: 매 closed-form solution exists (small linear regression — use normal equation); 매 second-order necessary (small classical ML).

## ❌ 안티패턴
- **lr too high**: 매 loss explosion / NaN. 매 warmup + clip.
- **No weight decay**: 매 overfitting.
- **Momentum with lr too high**: 매 oscillation.
- **AdamW lr=1e-3 for LLM**: 매 too high — 1e-4 ~ 3e-4 가 매 standard.
- **Batch size 1 on GPU**: 매 underutilization. 매 32+ 의 사용.

## 🧪 검증 / 중복
- Verified (PyTorch docs 2.5; Goodfellow *Deep Learning* ch.8; Loshchilov AdamW 2019).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — SGD + modern variants (AdamW, Lion, Muon) for 2026 |