Files
2nd/10_Wiki/Topics/AI_and_ML/GPU.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.3 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-gpu GPU 10_Wiki/Topics verified self
GPU
graphics processing unit
NVIDIA
AMD
H100
A100
B200
accelerator
none A 0.96 applied
gpu
hardware
ai-infra
cuda
ml-acceleration
hpc
2026-05-10 pending
language framework
CUDA / HIP / Metal / WGSL PyTorch / TensorRT / CUDA Toolkit

GPU

매 한 줄

"매 SIMD parallel processor — 매 매 ML / graphics workhorse". 매 modern: 매 NVIDIA H100/B200, AMD MI300X, Apple Silicon, Google TPU. 매 ML compute 의 dominant. 매 SM, 매 tensor core, 매 HBM, 매 NVLink. 매 cost / availability 의 ML 의 strategic concern.

매 핵심

매 architecture (NVIDIA)

  • SM (Streaming Multiprocessor).
  • CUDA Core (FP32).
  • Tensor Core (matrix mul, FP16/BF16/FP8/INT4).
  • Memory hierarchy: HBM → L2 → L1/SMEM → registers.
  • Warp: 32 threads.
  • Block: 매 SM 의 schedule.

매 modern GPU (2024-2026)

  • NVIDIA H100 (Hopper): 매 80GB HBM3, 매 transformer engine, FP8.
  • NVIDIA B200 (Blackwell): 매 192GB HBM3e, FP4, 매 dual die.
  • AMD MI300X: 매 192GB HBM3, 매 ROCm.
  • Apple Silicon (M3, M4): 매 unified memory, MLX.
  • Google TPU v5p: 매 systolic array, jax.

매 metric

  • TFLOPS: 매 FP32 / FP16 / FP8.
  • Memory BW: 매 HBM bandwidth.
  • Memory size: 매 model fit.
  • NVLink / Infiniband: 매 multi-GPU.
  • Power (TDP).

매 응용

  1. ML training: 매 matrix mul.
  2. ML inference.
  3. Graphics: 매 raster + RT.
  4. HPC: 매 simulation.
  5. Crypto (declining).
  6. Video (encode/decode).

매 modern AI infra

  • Multi-GPU (NVLink, NVSwitch).
  • Multi-node (Infiniband, RoCE).
  • Distributed training (FSDP, ZeRO, TP).
  • vLLM / TensorRT-LLM for inference.
  • Quantization (FP8, INT4).

💻 패턴

Check GPU (PyTorch)

import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
print(torch.cuda.get_device_properties(0))
# 매 capability >= 7.0 → tensor core
# 매 >= 9.0 → Hopper / FP8

Tensor (move to GPU)

x = torch.randn(1024, 1024).cuda()
# 매 or
x = torch.randn(1024, 1024, device='cuda')

Mixed precision (autocast)

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast(dtype=torch.bfloat16):  # 매 H100 friendly
    loss = model(x)
scaler.scale(loss).backward()
scaler.step(optim)
scaler.update()

FP8 (H100+)

import transformer_engine.pytorch as te
fp8_recipe = te.recipe.DelayedScaling(
    margin=0, interval=1, fp8_format=te.recipe.Format.HYBRID,
)
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    out = te_linear(x)

Multi-GPU (DDP)

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group('nccl')
model = DDP(model.cuda(), device_ids=[local_rank])

FSDP (sharded)

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, MixedPrecision
fsdp_model = FSDP(
    model,
    mixed_precision=MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.float32),
    device_id=local_rank,
)

CUDA kernel (custom)

__global__ void vector_add(float* a, float* b, float* c, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) c[i] = a[i] + b[i];
}

// 매 launch
vector_add<<<(N + 255) / 256, 256>>>(a, b, c, N);

Triton (Python kernel)

import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, N, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offsets < N
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)

MLX (Apple)

import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = mx.add(a, b)
# 매 unified memory, 매 lazy

TensorRT (NVIDIA inference)

import tensorrt as trt
builder = trt.Builder(logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)
engine = builder.build_serialized_network(network, config)
# 매 vs PyTorch 의 2-5x speedup

vLLM (LLM serving)

from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-70B-Instruct', tensor_parallel_size=4)
outputs = llm.generate(prompts, SamplingParams(max_tokens=200))

Memory profiling

torch.cuda.reset_peak_memory_stats()
out = model(x)
print(f'Peak: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB')
print(f'Reserved: {torch.cuda.max_memory_reserved() / 1e9:.2f} GB')

nvidia-smi (CLI)

nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv
nvtop  # 매 interactive

Quantization (8-bit + 4-bit)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained('llama-3-70b', quantization_config=bnb)
# 매 70B 의 single H100 의 fit

Spot / preemptible

# 매 cloud GPU cost 의 30-70% 의 cheaper
# 매 checkpoint 의 의 의 의 frequent
def train_with_checkpoint(model, every_n_steps=100):
    for step, batch in enumerate(loader):
        train_step(batch)
        if step % every_n_steps == 0:
            save_checkpoint(model, step)

Compute capability check

def best_dtype(device):
    cap = torch.cuda.get_device_capability(device)
    if cap >= (9, 0): return 'fp8'  # 매 H100, B200
    if cap >= (8, 0): return 'bf16'  # 매 A100, RTX 30+
    if cap >= (7, 0): return 'fp16'  # 매 V100, RTX 20+
    return 'fp32'

매 결정 기준

상황 GPU
Frontier training H100 / B200 cluster
Cost-aware ML A100 / L40S
LLM inference H100 + vLLM
On-device M3 Max / RTX 4090
Workstation RTX 6000 Ada
AMD ecosystem MI300X + ROCm
Edge Jetson Orin

기본값: 매 NVIDIA + CUDA + bf16 + FSDP + vLLM. 매 매 cost = quantization + spot + multi-GPU efficient.

🔗 Graph

🤖 LLM 활용

언제: 매 ML training/inference. 매 graphics. 매 HPC. 언제 X: 매 CPU-only sufficient.

안티패턴

  • FP32 only: 매 modern hardware 의 waste.
  • No multi-GPU sharding for big: 매 OOM.
  • Cloud spot 의 no checkpoint: 매 lose progress.
  • Same dtype for old + new: 매 capability mismatch.

🧪 검증 / 중복

  • Verified (NVIDIA whitepapers, AMD CDNA, Apple MLX, vLLM).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-04-26 Auto
2026-05-08 Phase 1
2026-05-10 Manual cleanup — architecture + 매 PyTorch / Triton / MLX / TRT / vLLM code