f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
171 lines
5.6 KiB
Markdown
171 lines
5.6 KiB
Markdown
---
|
|
id: wiki-2026-0508-optimization-in-ai
|
|
title: Optimization in AI
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Optimizers, Gradient Descent Variants, Training Optimization]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.92
|
|
verification_status: applied
|
|
tags: [optimization, sgd, adam, adamw, lr-schedule, training]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack: { language: python, framework: pytorch }
|
|
---
|
|
|
|
# Optimization in AI
|
|
|
|
## 한 줄
|
|
손실을 최소화하는 파라미터 업데이트 알고리즘 — SGD, Adam(W), Lion, second-order — 와 lr 스케줄·warmup·gradient clipping의 조합.
|
|
|
|
## 핵심
|
|
- **First-order**: SGD(+Momentum/Nesterov), Adagrad, RMSProp, Adam, **AdamW**(decoupled WD), Lion(sign-based).
|
|
- **Second-order**: L-BFGS, K-FAC, Shampoo, Sophia(LLM-스케일).
|
|
- **LR schedule**: cosine, linear-warmup-decay, OneCycle, ReduceLROnPlateau.
|
|
- **Stabilization**: gradient clipping(norm), gradient checkpointing, mixed precision.
|
|
- LLM 기본 스택 (2026): AdamW + cosine + warmup 0.5~3% steps + clip 1.0 + bf16.
|
|
- Vision: SGD-momentum or AdamW + OneCycle.
|
|
- 대형 모델: Sophia, Shampoo, Adafactor (memory-efficient).
|
|
|
|
## 💻 패턴
|
|
|
|
```python
|
|
# 1. AdamW + cosine schedule + warmup (LLM 표준)
|
|
import torch
|
|
from torch.optim import AdamW
|
|
from torch.optim.lr_scheduler import LambdaLR
|
|
import math
|
|
|
|
def warmup_cosine(step, warmup, total):
|
|
if step < warmup:
|
|
return step / max(1, warmup)
|
|
p = (step - warmup) / max(1, total - warmup)
|
|
return 0.5 * (1 + math.cos(math.pi * p))
|
|
|
|
opt = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95),
|
|
weight_decay=0.1)
|
|
sched = LambdaLR(opt, lambda s: warmup_cosine(s, 1000, 100_000))
|
|
```
|
|
|
|
```python
|
|
# 2. Gradient clipping + mixed precision
|
|
from torch.cuda.amp import autocast, GradScaler
|
|
|
|
scaler = GradScaler()
|
|
for x, y in loader:
|
|
opt.zero_grad(set_to_none=True)
|
|
with autocast(dtype=torch.bfloat16):
|
|
loss = model(x, y)
|
|
scaler.scale(loss).backward()
|
|
scaler.unscale_(opt)
|
|
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
|
|
scaler.step(opt); scaler.update()
|
|
sched.step()
|
|
```
|
|
|
|
```python
|
|
# 3. SGD + Nesterov + OneCycle (vision baseline)
|
|
from torch.optim import SGD
|
|
from torch.optim.lr_scheduler import OneCycleLR
|
|
|
|
opt = SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True,
|
|
weight_decay=5e-4)
|
|
sched = OneCycleLR(opt, max_lr=0.1, total_steps=epochs * len(loader),
|
|
pct_start=0.1, anneal_strategy="cos")
|
|
```
|
|
|
|
```python
|
|
# 4. Lion (sign-based, 메모리 절감)
|
|
# pip install lion-pytorch
|
|
from lion_pytorch import Lion
|
|
|
|
opt = Lion(model.parameters(), lr=1e-4, weight_decay=1e-2)
|
|
# Adam 대비 lr ~1/3, wd ~3배 권장.
|
|
```
|
|
|
|
```python
|
|
# 5. Adafactor (메모리 ↓, T5/PaLM 계열)
|
|
from transformers.optimization import Adafactor
|
|
|
|
opt = Adafactor(model.parameters(),
|
|
lr=None, scale_parameter=True,
|
|
relative_step=True, warmup_init=True)
|
|
```
|
|
|
|
```python
|
|
# 6. ReduceLROnPlateau (eval loss 정체 시 감쇠)
|
|
from torch.optim.lr_scheduler import ReduceLROnPlateau
|
|
|
|
sched = ReduceLROnPlateau(opt, mode="min", factor=0.5, patience=3,
|
|
min_lr=1e-6)
|
|
for epoch in range(epochs):
|
|
train(...)
|
|
val_loss = evaluate(...)
|
|
sched.step(val_loss)
|
|
```
|
|
|
|
```python
|
|
# 7. Parameter group: bias/LayerNorm은 weight decay 제외
|
|
def param_groups(model, wd=0.1):
|
|
decay, no_decay = [], []
|
|
for n, p in model.named_parameters():
|
|
if not p.requires_grad: continue
|
|
if p.ndim <= 1 or n.endswith(".bias"):
|
|
no_decay.append(p)
|
|
else:
|
|
decay.append(p)
|
|
return [{"params": decay, "weight_decay": wd},
|
|
{"params": no_decay, "weight_decay": 0.0}]
|
|
|
|
opt = torch.optim.AdamW(param_groups(model), lr=3e-4)
|
|
```
|
|
|
|
```python
|
|
# 8. Sophia (LLM second-order light) — diagonal Hessian
|
|
# pip install Sophia-Optimizer
|
|
from sophia import SophiaG
|
|
|
|
opt = SophiaG(model.parameters(), lr=2e-4, betas=(0.965, 0.99),
|
|
rho=0.05, weight_decay=0.1)
|
|
# 매 k step Hessian estimate 갱신
|
|
```
|
|
|
|
## 결정 기준
|
|
|
|
| 시나리오 | 옵티마이저 + 스케줄 |
|
|
|---|---|
|
|
| LLM pretrain/finetune | AdamW + cosine + warmup, clip 1.0 |
|
|
| 메모리 부족(LLM) | Adafactor / 8-bit AdamW / Sophia |
|
|
| Vision CNN | SGD-momentum + OneCycle |
|
|
| Vision Transformer | AdamW + cosine |
|
|
| GAN | Adam(β1=0.5, β2=0.999) |
|
|
| RL | Adam, lr=3e-4 흔함 |
|
|
| 빠른 실험 | Adam(W) + ReduceLROnPlateau |
|
|
| 실험적 큰 batch | LAMB / Lion |
|
|
|
|
## 🔗 Graph
|
|
- Related: `[[Loss-Functions-Foundations]]`, `[[데이터_사이언스_및_ML_엔지니어링|Gradient-Descent]]`, ``, ``, `[[Gradient-Clipping]]`, `[[Weight-Decay]]`
|
|
|
|
## 🤖 LLM 활용
|
|
- HF `Trainer`는 AdamW + linear warmup이 기본 — `lr_scheduler_type="cosine"`로 변경 시 일반적으로 안정 향상.
|
|
- DeepSpeed/FSDP 시 ZeRO-Offload + 8-bit AdamW로 GPU mem 50% 절감.
|
|
|
|
## ❌ 안티패턴
|
|
- AdamW 기본 wd=0.01인데 0으로 두고 "weight decay 적용 중" 가정.
|
|
- LayerNorm·bias에도 weight decay 적용 (성능 저하).
|
|
- warmup 없이 AdamW 큰 lr → 초기 발산.
|
|
- gradient clipping 없이 transformer 학습 (간헐적 NaN).
|
|
- LR schedule을 step이 아닌 epoch마다 step (warmup 의미 사라짐).
|
|
|
|
## 🧪 검증
|
|
- LR finder(Smith): lr 지수 증가시키며 loss 곡선 → 권장 lr 감지.
|
|
- Train loss와 grad norm 동시 plot — clip 임계 적정한지 확인.
|
|
- bf16 vs fp32 일치도(loss 곡선)로 numeric 안정성 검증.
|
|
|
|
## 🕓 Changelog
|
|
- 2026-05-08 Phase 1: 초안.
|
|
- 2026-05-10 Manual cleanup: AdamW 표준, Sophia/Lion/Adafactor 추가.
|