--- id: wiki-2026-0508-precision-recursion title: Precision Recursion category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Mixed-Precision Recursive Refinement, Iterative Refinement] duplicate_of: none source_trust_level: A confidence_score: 0.85 verification_status: applied tags: [numerical-methods, mixed-precision, iterative-refinement, ML] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch-mlx-cuda --- # Precision Recursion ## 매 한 줄 > **"매 lower precision 으로 fast 계산 → 매 higher precision 으로 residual 매 correct → 매 recurse"**. 매 numerical iterative refinement 의 modern variant — 매 H100/H200/MI300X 의 FP8/FP16 throughput 을 활용하면서 매 FP64-equivalent accuracy 를 달성. 매 Higham (1997) 의 classical refinement 매 GPU mixed-precision 시대에서 매 부활. ## 매 핵심 ### 매 기본 mechanism ``` 1. Solve A x_lo = b in low precision (FP16/FP8) — fast 2. Compute residual r = b - A x_lo in high precision (FP32/FP64) 3. Solve A d = r in low precision — fast 4. x ← x_lo + d 5. Repeat until ||r|| < tol ``` ### 매 핵심 invariant - **Residual computation**: 매 high precision 필수 (X cancellation error). - **Solve**: 매 low precision OK (errors absorbed by refinement). - **Convergence**: 매 condition number κ(A) 적절시 매 quadratic. ### 매 응용 1. **Linear solve**: GMRES-IR (Carson & Higham 2018). 2. **LLM inference**: FP8 forward + FP32 residual streams. 3. **Optimization**: Adam in FP16 + FP32 master weights. 4. **Eigensolve**: 매 inverse iteration 매 mixed precision. ## 💻 패턴 ### Iterative refinement (linear solve) ```python import numpy as np def iterative_refinement(A, b, tol=1e-12, max_iter=10): """매 mixed-precision linear solve.""" A_lo = A.astype(np.float16) x = np.zeros_like(b) for k in range(max_iter): r = b - A @ x # 매 high-precision residual if np.linalg.norm(r) < tol: break d = np.linalg.solve(A_lo.astype(np.float32), r.astype(np.float32)) x = x + d.astype(b.dtype) return x, k + 1 ``` ### PyTorch AMP (Automatic Mixed Precision) ```python import torch from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() for batch in loader: optim.zero_grad() with autocast(dtype=torch.float16): loss = model(batch).loss # 매 FP16 forward scaler.scale(loss).backward() # 매 FP32 grad scale scaler.step(optim) # 매 FP32 master weight update scaler.update() ``` ### FP8 inference + FP32 accumulation (H100) ```python # Transformer Engine — Hopper FP8 import transformer_engine.pytorch as te from transformer_engine.common.recipe import Format, DelayedScaling fp8_recipe = DelayedScaling( margin=0, interval=1, fp8_format=Format.HYBRID, # 매 E4M3 fwd, E5M2 bwd ) with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe): out = model(x) # FP8 GEMMs, FP32 reductions ``` ### GMRES with iterative refinement ```python from scipy.sparse.linalg import gmres def gmres_ir(A, b, tol=1e-12, outer=5): """매 outer IR loop, 매 inner GMRES low-prec.""" x = np.zeros_like(b) A_lo = A.astype(np.float32) for _ in range(outer): r = b - A @ x if np.linalg.norm(r) < tol: return x d, _ = gmres(A_lo, r.astype(np.float32), atol=1e-6) x = x + d.astype(b.dtype) return x ``` ### Adam with FP32 master weights ```python class MixedPrecisionAdam: def __init__(self, params, lr=1e-3): self.params_fp16 = params # 매 storage self.params_fp32 = [p.detach().clone().float() for p in params] self.m = [torch.zeros_like(p) for p in self.params_fp32] self.v = [torch.zeros_like(p) for p in self.params_fp32] self.lr = lr; self.t = 0 def step(self): self.t += 1 for p16, p32, m, v in zip(self.params_fp16, self.params_fp32, self.m, self.v): g = p16.grad.float() m.mul_(0.9).add_(g, alpha=0.1) v.mul_(0.999).addcmul_(g, g, value=0.001) p32.addcdiv_(m, v.sqrt().add_(1e-8), value=-self.lr) p16.data.copy_(p32.half()) # 매 sync back ``` ## 매 결정 기준 | 상황 | Strategy | |---|---| | 매 ill-conditioned linear system | GMRES-IR mixed precision | | 매 LLM training | AMP (FP16/BF16 + FP32 master) | | 매 Hopper / Blackwell inference | FP8 + FP32 accumulate | | 매 well-conditioned + FP64 needed | 매 single-precision solve OK | **기본값**: 매 BF16 forward + FP32 master weights (training), FP8 inference (Hopper+). ## 🔗 Graph - 변형: [[Iterative-Refinement]] ## 🤖 LLM 활용 **언제**: 매 numerical stability debugging, 매 mixed-precision recipe selection, 매 condition number analysis. **언제 X**: 매 integer / discrete optimization — 매 precision concept 무관. ## ❌ 안티패턴 - **Low-precision residual**: 매 cancellation error 폭발 → 매 refinement 무용. - **Ill-conditioned + low-prec**: 매 κ(A) > 10⁶ + FP16 → 매 발산. - **No master weights**: 매 FP16 weight update 매 underflow. - **Skip warmup**: 매 FP8 매 calibration 없이 → 매 NaN. ## 🧪 검증 / 중복 - Verified (Higham 1997 *Accuracy and Stability*; Carson & Higham 2018 GMRES-IR; NVIDIA Transformer Engine docs 2024). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — iterative refinement + modern AMP/FP8 stack |