Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

5.5 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Optimization in AI

한 줄

손실을 최소화하는 파라미터 업데이트 알고리즘 — SGD, Adam(W), Lion, second-order — 와 lr 스케줄·warmup·gradient clipping의 조합.

핵심

First-order: SGD(+Momentum/Nesterov), Adagrad, RMSProp, Adam, AdamW(decoupled WD), Lion(sign-based).
Second-order: L-BFGS, K-FAC, Shampoo, Sophia(LLM-스케일).
LR schedule: cosine, linear-warmup-decay, OneCycle, ReduceLROnPlateau.
Stabilization: gradient clipping(norm), gradient checkpointing, mixed precision.
LLM 기본 스택 (2026): AdamW + cosine + warmup 0.5~3% steps + clip 1.0 + bf16.
Vision: SGD-momentum or AdamW + OneCycle.
대형 모델: Sophia, Shampoo, Adafactor (memory-efficient).

💻 패턴

# 1. AdamW + cosine schedule + warmup (LLM 표준)
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
import math

def warmup_cosine(step, warmup, total):
    if step < warmup:
        return step / max(1, warmup)
    p = (step - warmup) / max(1, total - warmup)
    return 0.5 * (1 + math.cos(math.pi * p))

opt = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95),
            weight_decay=0.1)
sched = LambdaLR(opt, lambda s: warmup_cosine(s, 1000, 100_000))

# 2. Gradient clipping + mixed precision
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for x, y in loader:
    opt.zero_grad(set_to_none=True)
    with autocast(dtype=torch.bfloat16):
        loss = model(x, y)
    scaler.scale(loss).backward()
    scaler.unscale_(opt)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(opt); scaler.update()
    sched.step()

# 3. SGD + Nesterov + OneCycle (vision baseline)
from torch.optim import SGD
from torch.optim.lr_scheduler import OneCycleLR

opt = SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True,
          weight_decay=5e-4)
sched = OneCycleLR(opt, max_lr=0.1, total_steps=epochs * len(loader),
                   pct_start=0.1, anneal_strategy="cos")

# 4. Lion (sign-based, 메모리 절감)
# pip install lion-pytorch
from lion_pytorch import Lion

opt = Lion(model.parameters(), lr=1e-4, weight_decay=1e-2)
# Adam 대비 lr ~1/3, wd ~3배 권장.

# 5. Adafactor (메모리 ↓, T5/PaLM 계열)
from transformers.optimization import Adafactor

opt = Adafactor(model.parameters(),
                lr=None, scale_parameter=True,
                relative_step=True, warmup_init=True)

# 6. ReduceLROnPlateau (eval loss 정체 시 감쇠)
from torch.optim.lr_scheduler import ReduceLROnPlateau

sched = ReduceLROnPlateau(opt, mode="min", factor=0.5, patience=3,
                          min_lr=1e-6)
for epoch in range(epochs):
    train(...)
    val_loss = evaluate(...)
    sched.step(val_loss)

# 7. Parameter group: bias/LayerNorm은 weight decay 제외
def param_groups(model, wd=0.1):
    decay, no_decay = [], []
    for n, p in model.named_parameters():
        if not p.requires_grad: continue
        if p.ndim <= 1 or n.endswith(".bias"):
            no_decay.append(p)
        else:
            decay.append(p)
    return [{"params": decay, "weight_decay": wd},
            {"params": no_decay, "weight_decay": 0.0}]

opt = torch.optim.AdamW(param_groups(model), lr=3e-4)

# 8. Sophia (LLM second-order light) — diagonal Hessian
# pip install Sophia-Optimizer
from sophia import SophiaG

opt = SophiaG(model.parameters(), lr=2e-4, betas=(0.965, 0.99),
              rho=0.05, weight_decay=0.1)
# 매 k step Hessian estimate 갱신

결정 기준

시나리오	옵티마이저 + 스케줄
LLM pretrain/finetune	AdamW + cosine + warmup, clip 1.0
메모리 부족(LLM)	Adafactor / 8-bit AdamW / Sophia
Vision CNN	SGD-momentum + OneCycle
Vision Transformer	AdamW + cosine
GAN	Adam(β1=0.5, β2=0.999)
RL	Adam, lr=3e-4 흔함
빠른 실험	Adam(W) + ReduceLROnPlateau
실험적 큰 batch	LAMB / Lion

🔗 Graph

Related: [[Loss-Functions-Foundations]], [[데이터 사이언스 및 ML 엔지니어링|Gradient-Descent]], , , [[Gradient-Clipping]], [[Weight-Decay]]

🤖 LLM 활용

HF Trainer는 AdamW + linear warmup이 기본 — lr_scheduler_type="cosine"로 변경 시 일반적으로 안정 향상.
DeepSpeed/FSDP 시 ZeRO-Offload + 8-bit AdamW로 GPU mem 50% 절감.

❌ 안티패턴

AdamW 기본 wd=0.01인데 0으로 두고 "weight decay 적용 중" 가정.
LayerNorm·bias에도 weight decay 적용 (성능 저하).
warmup 없이 AdamW 큰 lr → 초기 발산.
gradient clipping 없이 transformer 학습 (간헐적 NaN).
LR schedule을 step이 아닌 epoch마다 step (warmup 의미 사라짐).

🧪 검증

LR finder(Smith): lr 지수 증가시키며 loss 곡선 → 권장 lr 감지.
Train loss와 grad norm 동시 plot — clip 임계 적정한지 확인.
bf16 vs fp32 일치도(loss 곡선)로 numeric 안정성 검증.

🕓 Changelog

2026-05-08 Phase 1: 초안.
2026-05-10 Manual cleanup: AdamW 표준, Sophia/Lion/Adafactor 추가.

5.5 KiB Raw Blame History