[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -2,88 +2,169 @@
 id: wiki-2026-0508-optimization-in-ai
 title: Optimization in AI
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: [AI-OPT-CORE-001]
+aliases: [Optimizers, Gradient Descent Variants, Training Optimization]
 duplicate_of: none
 source_trust_level: A
-confidence_score: 1.0
-tags: [ai, Deep-Learning, Optimization, loss-function, training, convergence]
+confidence_score: 0.92
+verification_status: applied
+tags: [optimization, sgd, adam, adamw, lr-schedule, training]
 raw_sources: []
-last_reinforced: 2026-04-26
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
-tech_stack:
-  language: unspecified
-  framework: unspecified
+tech_stack: { language: python, framework: pytorch }
 ---

-# Optimization in AI (AI에서의 최적화)
+# Optimization in AI

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "데이터의 바다에서 모델의 '오답'을 최소화하는 최적의 가중치를 발굴하여, 기계의 계산을 지능의 통찰로 승화시켜라" — 신경망 모델의 예측값과 실제값 사이의 오차(Loss)를 줄이기 위해 모델의 파라미터를 반복적으로 조정하여 최적의 성능을 끌어내는 과정.
+## 한 줄
+손실을 최소화하는 파라미터 업데이트 알고리즘 — SGD, Adam(W), Lion, second-order — 와 lr 스케줄·warmup·gradient clipping의 조합.

-## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** "Empirical Risk Minimization and Gradient Flow" — 주어진 학습 데이터에 대해 손실 함수의 기울기를 따라가며 위험을 최소화하는 동시에, 보지 못한 데이터에도 잘 작동하도록 일반화(Generalization) 성능을 확보하는 균형 잡힌 최적화 패턴.
- **AI 최적화의 3대 요소:**
-    - **Objective Function (Loss):** 줄여야 할 목표 (예: MSE, Cross Entropy).
-    - **Optimizer:** 어떻게 줄일 것인가 (예: SGD, Adam, RMSProp).
-    - **[[Regularization|Regularization]]:** 너무 지나치게 학습하지 않도록 제어 (예: Dropout, Weight Decay).
- **의의:** AI 모델이 단순한 수식의 나열에서 학습을 통해 '능력'을 획득하게 만드는 실질적인 지능 구현의 심장.
+## 핵심
+- **First-order**: SGD(+Momentum/Nesterov), Adagrad, RMSProp, Adam, **AdamW**(decoupled WD), Lion(sign-based).
+- **Second-order**: L-BFGS, K-FAC, Shampoo, Sophia(LLM-스케일).
+- **LR schedule**: cosine, linear-warmup-decay, OneCycle, ReduceLROnPlateau.
+- **Stabilization**: gradient clipping(norm), gradient checkpointing, mixed precision.
+- LLM 기본 스택 (2026): AdamW + cosine + warmup 0.5~3% steps + clip 1.0 + bf16.
+- Vision: SGD-momentum or AdamW + OneCycle.
+- 대형 모델: Sophia, Shampoo, Adafactor (memory-efficient).

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 단순히 학습 오차를 0으로 만드는 것이 목표였던 시절을 지나, 이제는 '평평한 최적점(Flat Minima)'을 찾아야 모델의 일반화 성능이 좋아진다는 관점이 정립되어 이를 유도하는 최적화 기법(SAM 등)이 주목받고 있음.
- **정책 변화:** Antigravity 프로젝트는 대규모 언어 모델 학습 시, 수렴 속도와 최종 성능의 균형을 위해 학습률 스케줄링(Learning Rate Scheduling)과 AdamW 최적화 도구를 결합한 표준 파이프라인을 가동함.
+## 💻 패턴

-## 🔗 지식 연결 (Graph)
- [[Optimization-Algorithms|Optimization-Algorithms]], [[Gradient-Descent|Gradient-Descent]]-Foundations, [[Loss-Functions-Foundations|Loss-Functions-Foundations]], HyperParameter-Optimization
- **Raw Source:** 10_Wiki/Topics/AI/Optimization-in-AI.md
+```python
+# 1. AdamW + cosine schedule + warmup (LLM 표준)
+import torch
+from torch.optim import AdamW
+from torch.optim.lr_scheduler import LambdaLR
+import math

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+def warmup_cosine(step, warmup, total):
+    if step < warmup:
+        return step / max(1, warmup)
+    p = (step - warmup) / max(1, total - warmup)
+    return 0.5 * (1 + math.cos(math.pi * p))

-**언제 이 지식을 쓰는가:**
- *(TODO)*
-
-**언제 쓰면 안 되는가:**
- *(TODO)*
-
-## 🧪 검증 상태 (Validation)
-
- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
-
-## 🧬 중복 검사 (Duplicate Check)
-
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
-
-## 🕓 변경 이력 (Changelog)
-
-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
-
-## 💻 코드 패턴 (Code Patterns)
-
-**패턴 1:** *(TODO: 이 프로젝트 컨벤션 반영한 구조 스켈레톤)*
-
-```text
-# TODO
+opt = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95),
+            weight_decay=0.1)
+sched = LambdaLR(opt, lambda s: warmup_cosine(s, 1000, 100_000))
 ```

-## 🤔 의사결정 기준 (Decision Criteria)
+```python
+# 2. Gradient clipping + mixed precision
+from torch.cuda.amp import autocast, GradScaler

-**선택 A를 써야 할 때:**
- *(TODO)*
+scaler = GradScaler()
+for x, y in loader:
+    opt.zero_grad(set_to_none=True)
+    with autocast(dtype=torch.bfloat16):
+        loss = model(x, y)
+    scaler.scale(loss).backward()
+    scaler.unscale_(opt)
+    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
+    scaler.step(opt); scaler.update()
+    sched.step()
+```

-**선택 B를 써야 할 때:**
- *(TODO)*
+```python
+# 3. SGD + Nesterov + OneCycle (vision baseline)
+from torch.optim import SGD
+from torch.optim.lr_scheduler import OneCycleLR

-**기본값:**
-> *(TODO)*
+opt = SGD(model.parameters(), lr=0.1, momentum=0.9, nesterov=True,
+          weight_decay=5e-4)
+sched = OneCycleLR(opt, max_lr=0.1, total_steps=epochs * len(loader),
+                   pct_start=0.1, anneal_strategy="cos")
+```

-## ❌ 안티패턴 (Anti-Patterns)
+```python
+# 4. Lion (sign-based, 메모리 절감)
+# pip install lion-pytorch
+from lion_pytorch import Lion

- **[안티패턴]:** *(TODO: 무엇을 하면 안 되는가 + 이유 + 대신 무엇을)*
+opt = Lion(model.parameters(), lr=1e-4, weight_decay=1e-2)
+# Adam 대비 lr ~1/3, wd ~3배 권장.
+```
+
+```python
+# 5. Adafactor (메모리 ↓, T5/PaLM 계열)
+from transformers.optimization import Adafactor
+
+opt = Adafactor(model.parameters(),
+                lr=None, scale_parameter=True,
+                relative_step=True, warmup_init=True)
+```
+
+```python
+# 6. ReduceLROnPlateau (eval loss 정체 시 감쇠)
+from torch.optim.lr_scheduler import ReduceLROnPlateau
+
+sched = ReduceLROnPlateau(opt, mode="min", factor=0.5, patience=3,
+                          min_lr=1e-6)
+for epoch in range(epochs):
+    train(...)
+    val_loss = evaluate(...)
+    sched.step(val_loss)
+```
+
+```python
+# 7. Parameter group: bias/LayerNorm은 weight decay 제외
+def param_groups(model, wd=0.1):
+    decay, no_decay = [], []
+    for n, p in model.named_parameters():
+        if not p.requires_grad: continue
+        if p.ndim <= 1 or n.endswith(".bias"):
+            no_decay.append(p)
+        else:
+            decay.append(p)
+    return [{"params": decay, "weight_decay": wd},
+            {"params": no_decay, "weight_decay": 0.0}]
+
+opt = torch.optim.AdamW(param_groups(model), lr=3e-4)
+```
+
+```python
+# 8. Sophia (LLM second-order light) — diagonal Hessian
+# pip install Sophia-Optimizer
+from sophia import SophiaG
+
+opt = SophiaG(model.parameters(), lr=2e-4, betas=(0.965, 0.99),
+              rho=0.05, weight_decay=0.1)
+# 매 k step Hessian estimate 갱신
+```
+
+## 결정 기준
+
+| 시나리오 | 옵티마이저 + 스케줄 |
+|---|---|
+| LLM pretrain/finetune | AdamW + cosine + warmup, clip 1.0 |
+| 메모리 부족(LLM) | Adafactor / 8-bit AdamW / Sophia |
+| Vision CNN | SGD-momentum + OneCycle |
+| Vision Transformer | AdamW + cosine |
+| GAN | Adam(β1=0.5, β2=0.999) |
+| RL | Adam, lr=3e-4 흔함 |
+| 빠른 실험 | Adam(W) + ReduceLROnPlateau |
+| 실험적 큰 batch | LAMB / Lion |
+
+## 🔗 Graph
+- Related: `[[Loss-Functions-Foundations]]`, `[[Gradient-Descent]]`, `[[Learning-Rate-Schedule]]`, `[[Mixed-Precision-Training]]`, `[[Gradient-Clipping]]`, `[[Weight-Decay]]`
+
+## 🤖 LLM 활용
+- HF `Trainer`는 AdamW + linear warmup이 기본 — `lr_scheduler_type="cosine"`로 변경 시 일반적으로 안정 향상.
+- DeepSpeed/FSDP 시 ZeRO-Offload + 8-bit AdamW로 GPU mem 50% 절감.
+
+## ❌ 안티패턴
+- AdamW 기본 wd=0.01인데 0으로 두고 "weight decay 적용 중" 가정.
+- LayerNorm·bias에도 weight decay 적용 (성능 저하).
+- warmup 없이 AdamW 큰 lr → 초기 발산.
+- gradient clipping 없이 transformer 학습 (간헐적 NaN).
+- LR schedule을 step이 아닌 epoch마다 step (warmup 의미 사라짐).
+
+## 🧪 검증
+- LR finder(Smith): lr 지수 증가시키며 loss 곡선 → 권장 lr 감지.
+- Train loss와 grad norm 동시 plot — clip 임계 적정한지 확인.
+- bf16 vs fp32 일치도(loss 곡선)로 numeric 안정성 검증.
+
+## 🕓 Changelog
+- 2026-05-08 Phase 1: 초안.
+- 2026-05-10 Manual cleanup: AdamW 표준, Sophia/Lion/Adafactor 추가.