f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
165 lines
5.4 KiB
Markdown
165 lines
5.4 KiB
Markdown
---
|
|
id: wiki-2026-0508-precision-recursion
|
|
title: Precision Recursion
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Mixed-Precision Recursive Refinement, Iterative Refinement]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.85
|
|
verification_status: applied
|
|
tags: [numerical-methods, mixed-precision, iterative-refinement, ML]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: pytorch-mlx-cuda
|
|
---
|
|
|
|
# Precision Recursion
|
|
|
|
## 매 한 줄
|
|
> **"매 lower precision 으로 fast 계산 → 매 higher precision 으로 residual 매 correct → 매 recurse"**. 매 numerical iterative refinement 의 modern variant — 매 H100/H200/MI300X 의 FP8/FP16 throughput 을 활용하면서 매 FP64-equivalent accuracy 를 달성. 매 Higham (1997) 의 classical refinement 매 GPU mixed-precision 시대에서 매 부활.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 기본 mechanism
|
|
```
|
|
1. Solve A x_lo = b in low precision (FP16/FP8) — fast
|
|
2. Compute residual r = b - A x_lo in high precision (FP32/FP64)
|
|
3. Solve A d = r in low precision — fast
|
|
4. x ← x_lo + d
|
|
5. Repeat until ||r|| < tol
|
|
```
|
|
|
|
### 매 핵심 invariant
|
|
- **Residual computation**: 매 high precision 필수 (X cancellation error).
|
|
- **Solve**: 매 low precision OK (errors absorbed by refinement).
|
|
- **Convergence**: 매 condition number κ(A) 적절시 매 quadratic.
|
|
|
|
### 매 응용
|
|
1. **Linear solve**: GMRES-IR (Carson & Higham 2018).
|
|
2. **LLM inference**: FP8 forward + FP32 residual streams.
|
|
3. **Optimization**: Adam in FP16 + FP32 master weights.
|
|
4. **Eigensolve**: 매 inverse iteration 매 mixed precision.
|
|
|
|
## 💻 패턴
|
|
|
|
### Iterative refinement (linear solve)
|
|
```python
|
|
import numpy as np
|
|
|
|
def iterative_refinement(A, b, tol=1e-12, max_iter=10):
|
|
"""매 mixed-precision linear solve."""
|
|
A_lo = A.astype(np.float16)
|
|
x = np.zeros_like(b)
|
|
for k in range(max_iter):
|
|
r = b - A @ x # 매 high-precision residual
|
|
if np.linalg.norm(r) < tol:
|
|
break
|
|
d = np.linalg.solve(A_lo.astype(np.float32), r.astype(np.float32))
|
|
x = x + d.astype(b.dtype)
|
|
return x, k + 1
|
|
```
|
|
|
|
### PyTorch AMP (Automatic Mixed Precision)
|
|
```python
|
|
import torch
|
|
from torch.cuda.amp import autocast, GradScaler
|
|
|
|
scaler = GradScaler()
|
|
for batch in loader:
|
|
optim.zero_grad()
|
|
with autocast(dtype=torch.float16):
|
|
loss = model(batch).loss # 매 FP16 forward
|
|
scaler.scale(loss).backward() # 매 FP32 grad scale
|
|
scaler.step(optim) # 매 FP32 master weight update
|
|
scaler.update()
|
|
```
|
|
|
|
### FP8 inference + FP32 accumulation (H100)
|
|
```python
|
|
# Transformer Engine — Hopper FP8
|
|
import transformer_engine.pytorch as te
|
|
from transformer_engine.common.recipe import Format, DelayedScaling
|
|
|
|
fp8_recipe = DelayedScaling(
|
|
margin=0, interval=1,
|
|
fp8_format=Format.HYBRID, # 매 E4M3 fwd, E5M2 bwd
|
|
)
|
|
|
|
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
|
|
out = model(x) # FP8 GEMMs, FP32 reductions
|
|
```
|
|
|
|
### GMRES with iterative refinement
|
|
```python
|
|
from scipy.sparse.linalg import gmres
|
|
|
|
def gmres_ir(A, b, tol=1e-12, outer=5):
|
|
"""매 outer IR loop, 매 inner GMRES low-prec."""
|
|
x = np.zeros_like(b)
|
|
A_lo = A.astype(np.float32)
|
|
for _ in range(outer):
|
|
r = b - A @ x
|
|
if np.linalg.norm(r) < tol:
|
|
return x
|
|
d, _ = gmres(A_lo, r.astype(np.float32), atol=1e-6)
|
|
x = x + d.astype(b.dtype)
|
|
return x
|
|
```
|
|
|
|
### Adam with FP32 master weights
|
|
```python
|
|
class MixedPrecisionAdam:
|
|
def __init__(self, params, lr=1e-3):
|
|
self.params_fp16 = params # 매 storage
|
|
self.params_fp32 = [p.detach().clone().float() for p in params]
|
|
self.m = [torch.zeros_like(p) for p in self.params_fp32]
|
|
self.v = [torch.zeros_like(p) for p in self.params_fp32]
|
|
self.lr = lr; self.t = 0
|
|
def step(self):
|
|
self.t += 1
|
|
for p16, p32, m, v in zip(self.params_fp16, self.params_fp32, self.m, self.v):
|
|
g = p16.grad.float()
|
|
m.mul_(0.9).add_(g, alpha=0.1)
|
|
v.mul_(0.999).addcmul_(g, g, value=0.001)
|
|
p32.addcdiv_(m, v.sqrt().add_(1e-8), value=-self.lr)
|
|
p16.data.copy_(p32.half()) # 매 sync back
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Strategy |
|
|
|---|---|
|
|
| 매 ill-conditioned linear system | GMRES-IR mixed precision |
|
|
| 매 LLM training | AMP (FP16/BF16 + FP32 master) |
|
|
| 매 Hopper / Blackwell inference | FP8 + FP32 accumulate |
|
|
| 매 well-conditioned + FP64 needed | 매 single-precision solve OK |
|
|
|
|
**기본값**: 매 BF16 forward + FP32 master weights (training), FP8 inference (Hopper+).
|
|
|
|
## 🔗 Graph
|
|
- 변형: [[Iterative-Refinement]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 numerical stability debugging, 매 mixed-precision recipe selection, 매 condition number analysis.
|
|
**언제 X**: 매 integer / discrete optimization — 매 precision concept 무관.
|
|
|
|
## ❌ 안티패턴
|
|
- **Low-precision residual**: 매 cancellation error 폭발 → 매 refinement 무용.
|
|
- **Ill-conditioned + low-prec**: 매 κ(A) > 10⁶ + FP16 → 매 발산.
|
|
- **No master weights**: 매 FP16 weight update 매 underflow.
|
|
- **Skip warmup**: 매 FP8 매 calibration 없이 → 매 NaN.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Higham 1997 *Accuracy and Stability*; Carson & Higham 2018 GMRES-IR; NVIDIA Transformer Engine docs 2024).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — iterative refinement + modern AMP/FP8 stack |
|