--- id: wiki-2026-0508-optimization-in-ai title: Optimization in AI category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Optimizers, Gradient Descent Variants, Training Optimization] duplicate_of: none source_trust_level: A confidence_score: 0.92 verification_status: applied tags: [optimization, sgd, adam, adamw, lr-schedule, training] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: { language: python, framework: pytorch } --- # Optimization in AI ## 한 줄 손실을 최소화하는 파라미터 업데이트 알고리즘 — SGD, Adam(W), Lion, second-order — 와 lr 스케줄·warmup·gradient clipping의 조합. ## 핵심 - **First-order**: SGD(+Momentum/Nesterov), Adagrad, RMSProp, Adam, **AdamW**(decoupled WD), Lion(sign-based). - **Second-order**: L-BFGS, K-FAC, Shampoo, Sophia(LLM-스케일). - **LR schedule**: cosine, linear-warmup-decay, OneCycle, ReduceLROnPlateau. - **Stabilization**: gradient clipping(norm), gradient checkpointing, mixed precision. - LLM 기본 스택 (2026): AdamW + cosine + warmup 0.5~3% steps + clip 1.0 + bf16. - Vision: SGD-momentum or AdamW + OneCycle. - 대형 모델: Sophia, Shampoo, Adafactor (memory-efficient). ## 💻 패턴 ```python # 1. AdamW + cosine schedule + warmup (LLM 표준) import torch from torch.optim import AdamW from torch.optim.lr_scheduler import LambdaLR import math def warmup_cosine(step, warmup, total): if step < warmup: return step / max(1, warmup) p = (step - warmup) / max(1, total - warmup) return 0.5 * (1 + math.cos(math.pi * p)) opt = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1) sched = LambdaLR(opt, lambda s: warmup_cosine(s, 1000, 100_000)) ``` ```python # 2. Gradient clipping + mixed precision from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for x, y in loader: opt.zero_grad(set_to_none=True) with autocast(dtype=torch.bfloat16): loss = model(x, y) scaler.scale(loss).backward() scaler.unscale_(opt) torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) scaler.step(opt); scaler.update() sched.step() ``` ```python # 3. SGD + Nesterov + OneCycle (vision baseline) from torch.optim import SGD from torch.optim.lr_scheduler import OneCycleLR opt = SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True, weight_decay=5e-4) sched = OneCycleLR(opt, max_lr=0.1, total_steps=epochs * len(loader), pct_start=0.1, anneal_strategy="cos") ``` ```python # 4. Lion (sign-based, 메모리 절감) # pip install lion-pytorch from lion_pytorch import Lion opt = Lion(model.parameters(), lr=1e-4, weight_decay=1e-2) # Adam 대비 lr ~1/3, wd ~3배 권장. ``` ```python # 5. Adafactor (메모리 ↓, T5/PaLM 계열) from transformers.optimization import Adafactor opt = Adafactor(model.parameters(), lr=None, scale_parameter=True, relative_step=True, warmup_init=True) ``` ```python # 6. ReduceLROnPlateau (eval loss 정체 시 감쇠) from torch.optim.lr_scheduler import ReduceLROnPlateau sched = ReduceLROnPlateau(opt, mode="min", factor=0.5, patience=3, min_lr=1e-6) for epoch in range(epochs): train(...) val_loss = evaluate(...) sched.step(val_loss) ``` ```python # 7. Parameter group: bias/LayerNorm은 weight decay 제외 def param_groups(model, wd=0.1): decay, no_decay = [], [] for n, p in model.named_parameters(): if not p.requires_grad: continue if p.ndim <= 1 or n.endswith(".bias"): no_decay.append(p) else: decay.append(p) return [{"params": decay, "weight_decay": wd}, {"params": no_decay, "weight_decay": 0.0}] opt = torch.optim.AdamW(param_groups(model), lr=3e-4) ``` ```python # 8. Sophia (LLM second-order light) — diagonal Hessian # pip install Sophia-Optimizer from sophia import SophiaG opt = SophiaG(model.parameters(), lr=2e-4, betas=(0.965, 0.99), rho=0.05, weight_decay=0.1) # 매 k step Hessian estimate 갱신 ``` ## 결정 기준 | 시나리오 | 옵티마이저 + 스케줄 | |---|---| | LLM pretrain/finetune | AdamW + cosine + warmup, clip 1.0 | | 메모리 부족(LLM) | Adafactor / 8-bit AdamW / Sophia | | Vision CNN | SGD-momentum + OneCycle | | Vision Transformer | AdamW + cosine | | GAN | Adam(β1=0.5, β2=0.999) | | RL | Adam, lr=3e-4 흔함 | | 빠른 실험 | Adam(W) + ReduceLROnPlateau | | 실험적 큰 batch | LAMB / Lion | ## 🔗 Graph - Related: `[[Loss-Functions-Foundations]]`, `[[데이터 사이언스 및 ML 엔지니어링|Gradient-Descent]]`, ``, ``, `[[Gradient-Clipping]]`, `[[Weight-Decay]]` ## 🤖 LLM 활용 - HF `Trainer`는 AdamW + linear warmup이 기본 — `lr_scheduler_type="cosine"`로 변경 시 일반적으로 안정 향상. - DeepSpeed/FSDP 시 ZeRO-Offload + 8-bit AdamW로 GPU mem 50% 절감. ## ❌ 안티패턴 - AdamW 기본 wd=0.01인데 0으로 두고 "weight decay 적용 중" 가정. - LayerNorm·bias에도 weight decay 적용 (성능 저하). - warmup 없이 AdamW 큰 lr → 초기 발산. - gradient clipping 없이 transformer 학습 (간헐적 NaN). - LR schedule을 step이 아닌 epoch마다 step (warmup 의미 사라짐). ## 🧪 검증 - LR finder(Smith): lr 지수 증가시키며 loss 곡선 → 권장 lr 감지. - Train loss와 grad norm 동시 plot — clip 임계 적정한지 확인. - bf16 vs fp32 일치도(loss 곡선)로 numeric 안정성 검증. ## 🕓 Changelog - 2026-05-08 Phase 1: 초안. - 2026-05-10 Manual cleanup: AdamW 표준, Sophia/Lion/Adafactor 추가.