Files
2nd/10_Wiki/Topics/Other/Precision-Recursion.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.4 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-precision-recursion Precision Recursion 10_Wiki/Topics verified self
Mixed-Precision Recursive Refinement
Iterative Refinement
none A 0.85 applied
numerical-methods
mixed-precision
iterative-refinement
ML
2026-05-10 pending
language framework
python pytorch-mlx-cuda

Precision Recursion

매 한 줄

"매 lower precision 으로 fast 계산 → 매 higher precision 으로 residual 매 correct → 매 recurse". 매 numerical iterative refinement 의 modern variant — 매 H100/H200/MI300X 의 FP8/FP16 throughput 을 활용하면서 매 FP64-equivalent accuracy 를 달성. 매 Higham (1997) 의 classical refinement 매 GPU mixed-precision 시대에서 매 부활.

매 핵심

매 기본 mechanism

1. Solve A x_lo = b in low precision (FP16/FP8) — fast
2. Compute residual r = b - A x_lo in high precision (FP32/FP64)
3. Solve A d = r in low precision — fast
4. x ← x_lo + d
5. Repeat until ||r|| < tol

매 핵심 invariant

  • Residual computation: 매 high precision 필수 (X cancellation error).
  • Solve: 매 low precision OK (errors absorbed by refinement).
  • Convergence: 매 condition number κ(A) 적절시 매 quadratic.

매 응용

  1. Linear solve: GMRES-IR (Carson & Higham 2018).
  2. LLM inference: FP8 forward + FP32 residual streams.
  3. Optimization: Adam in FP16 + FP32 master weights.
  4. Eigensolve: 매 inverse iteration 매 mixed precision.

💻 패턴

Iterative refinement (linear solve)

import numpy as np

def iterative_refinement(A, b, tol=1e-12, max_iter=10):
    """매 mixed-precision linear solve."""
    A_lo = A.astype(np.float16)
    x = np.zeros_like(b)
    for k in range(max_iter):
        r = b - A @ x  # 매 high-precision residual
        if np.linalg.norm(r) < tol:
            break
        d = np.linalg.solve(A_lo.astype(np.float32), r.astype(np.float32))
        x = x + d.astype(b.dtype)
    return x, k + 1

PyTorch AMP (Automatic Mixed Precision)

import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for batch in loader:
    optim.zero_grad()
    with autocast(dtype=torch.float16):
        loss = model(batch).loss  # 매 FP16 forward
    scaler.scale(loss).backward()  # 매 FP32 grad scale
    scaler.step(optim)              # 매 FP32 master weight update
    scaler.update()

FP8 inference + FP32 accumulation (H100)

# Transformer Engine — Hopper FP8
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling

fp8_recipe = DelayedScaling(
    margin=0, interval=1,
    fp8_format=Format.HYBRID,  # 매 E4M3 fwd, E5M2 bwd
)

with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    out = model(x)  # FP8 GEMMs, FP32 reductions

GMRES with iterative refinement

from scipy.sparse.linalg import gmres

def gmres_ir(A, b, tol=1e-12, outer=5):
    """매 outer IR loop, 매 inner GMRES low-prec."""
    x = np.zeros_like(b)
    A_lo = A.astype(np.float32)
    for _ in range(outer):
        r = b - A @ x
        if np.linalg.norm(r) < tol:
            return x
        d, _ = gmres(A_lo, r.astype(np.float32), atol=1e-6)
        x = x + d.astype(b.dtype)
    return x

Adam with FP32 master weights

class MixedPrecisionAdam:
    def __init__(self, params, lr=1e-3):
        self.params_fp16 = params  # 매 storage
        self.params_fp32 = [p.detach().clone().float() for p in params]
        self.m = [torch.zeros_like(p) for p in self.params_fp32]
        self.v = [torch.zeros_like(p) for p in self.params_fp32]
        self.lr = lr; self.t = 0
    def step(self):
        self.t += 1
        for p16, p32, m, v in zip(self.params_fp16, self.params_fp32, self.m, self.v):
            g = p16.grad.float()
            m.mul_(0.9).add_(g, alpha=0.1)
            v.mul_(0.999).addcmul_(g, g, value=0.001)
            p32.addcdiv_(m, v.sqrt().add_(1e-8), value=-self.lr)
            p16.data.copy_(p32.half())  # 매 sync back

매 결정 기준

상황 Strategy
매 ill-conditioned linear system GMRES-IR mixed precision
매 LLM training AMP (FP16/BF16 + FP32 master)
매 Hopper / Blackwell inference FP8 + FP32 accumulate
매 well-conditioned + FP64 needed 매 single-precision solve OK

기본값: 매 BF16 forward + FP32 master weights (training), FP8 inference (Hopper+).

🔗 Graph

🤖 LLM 활용

언제: 매 numerical stability debugging, 매 mixed-precision recipe selection, 매 condition number analysis. 언제 X: 매 integer / discrete optimization — 매 precision concept 무관.

안티패턴

  • Low-precision residual: 매 cancellation error 폭발 → 매 refinement 무용.
  • Ill-conditioned + low-prec: 매 κ(A) > 10⁶ + FP16 → 매 발산.
  • No master weights: 매 FP16 weight update 매 underflow.
  • Skip warmup: 매 FP8 매 calibration 없이 → 매 NaN.

🧪 검증 / 중복

  • Verified (Higham 1997 Accuracy and Stability; Carson & Higham 2018 GMRES-IR; NVIDIA Transformer Engine docs 2024).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — iterative refinement + modern AMP/FP8 stack