5.6 KiB
5.6 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-precision-recursion | Precision Recursion | 10_Wiki/Topics | verified | self |
|
none | A | 0.85 | applied |
|
2026-05-10 | pending |
|
Precision Recursion
매 한 줄
"매 lower precision 으로 fast 계산 → 매 higher precision 으로 residual 매 correct → 매 recurse". 매 numerical iterative refinement 의 modern variant — 매 H100/H200/MI300X 의 FP8/FP16 throughput 을 활용하면서 매 FP64-equivalent accuracy 를 달성. 매 Higham (1997) 의 classical refinement 매 GPU mixed-precision 시대에서 매 부활.
매 핵심
매 기본 mechanism
1. Solve A x_lo = b in low precision (FP16/FP8) — fast
2. Compute residual r = b - A x_lo in high precision (FP32/FP64)
3. Solve A d = r in low precision — fast
4. x ← x_lo + d
5. Repeat until ||r|| < tol
매 핵심 invariant
- Residual computation: 매 high precision 필수 (X cancellation error).
- Solve: 매 low precision OK (errors absorbed by refinement).
- Convergence: 매 condition number κ(A) 적절시 매 quadratic.
매 응용
- Linear solve: GMRES-IR (Carson & Higham 2018).
- LLM inference: FP8 forward + FP32 residual streams.
- Optimization: Adam in FP16 + FP32 master weights.
- Eigensolve: 매 inverse iteration 매 mixed precision.
💻 패턴
Iterative refinement (linear solve)
import numpy as np
def iterative_refinement(A, b, tol=1e-12, max_iter=10):
"""매 mixed-precision linear solve."""
A_lo = A.astype(np.float16)
x = np.zeros_like(b)
for k in range(max_iter):
r = b - A @ x # 매 high-precision residual
if np.linalg.norm(r) < tol:
break
d = np.linalg.solve(A_lo.astype(np.float32), r.astype(np.float32))
x = x + d.astype(b.dtype)
return x, k + 1
PyTorch AMP (Automatic Mixed Precision)
import torch
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in loader:
optim.zero_grad()
with autocast(dtype=torch.float16):
loss = model(batch).loss # 매 FP16 forward
scaler.scale(loss).backward() # 매 FP32 grad scale
scaler.step(optim) # 매 FP32 master weight update
scaler.update()
FP8 inference + FP32 accumulation (H100)
# Transformer Engine — Hopper FP8
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling
fp8_recipe = DelayedScaling(
margin=0, interval=1,
fp8_format=Format.HYBRID, # 매 E4M3 fwd, E5M2 bwd
)
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
out = model(x) # FP8 GEMMs, FP32 reductions
GMRES with iterative refinement
from scipy.sparse.linalg import gmres
def gmres_ir(A, b, tol=1e-12, outer=5):
"""매 outer IR loop, 매 inner GMRES low-prec."""
x = np.zeros_like(b)
A_lo = A.astype(np.float32)
for _ in range(outer):
r = b - A @ x
if np.linalg.norm(r) < tol:
return x
d, _ = gmres(A_lo, r.astype(np.float32), atol=1e-6)
x = x + d.astype(b.dtype)
return x
Adam with FP32 master weights
class MixedPrecisionAdam:
def __init__(self, params, lr=1e-3):
self.params_fp16 = params # 매 storage
self.params_fp32 = [p.detach().clone().float() for p in params]
self.m = [torch.zeros_like(p) for p in self.params_fp32]
self.v = [torch.zeros_like(p) for p in self.params_fp32]
self.lr = lr; self.t = 0
def step(self):
self.t += 1
for p16, p32, m, v in zip(self.params_fp16, self.params_fp32, self.m, self.v):
g = p16.grad.float()
m.mul_(0.9).add_(g, alpha=0.1)
v.mul_(0.999).addcmul_(g, g, value=0.001)
p32.addcdiv_(m, v.sqrt().add_(1e-8), value=-self.lr)
p16.data.copy_(p32.half()) # 매 sync back
매 결정 기준
| 상황 | Strategy |
|---|---|
| 매 ill-conditioned linear system | GMRES-IR mixed precision |
| 매 LLM training | AMP (FP16/BF16 + FP32 master) |
| 매 Hopper / Blackwell inference | FP8 + FP32 accumulate |
| 매 well-conditioned + FP64 needed | 매 single-precision solve OK |
기본값: 매 BF16 forward + FP32 master weights (training), FP8 inference (Hopper+).
🔗 Graph
- 부모: Numerical-Methods · Mixed-Precision-Training
- 변형: Iterative-Refinement · GMRES-IR
- 응용: LLM-Training · Scientific-Computing · FP8-Inference
- Adjacent: Floating-Point-Arithmetic · Condition-Number
🤖 LLM 활용
언제: 매 numerical stability debugging, 매 mixed-precision recipe selection, 매 condition number analysis. 언제 X: 매 integer / discrete optimization — 매 precision concept 무관.
❌ 안티패턴
- Low-precision residual: 매 cancellation error 폭발 → 매 refinement 무용.
- Ill-conditioned + low-prec: 매 κ(A) > 10⁶ + FP16 → 매 발산.
- No master weights: 매 FP16 weight update 매 underflow.
- Skip warmup: 매 FP8 매 calibration 없이 → 매 NaN.
🧪 검증 / 중복
- Verified (Higham 1997 Accuracy and Stability; Carson & Higham 2018 GMRES-IR; NVIDIA Transformer Engine docs 2024).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — iterative refinement + modern AMP/FP8 stack |