2nd/10_Wiki/Topics/AI_and_ML/GPU.md

---
id: wiki-2026-0508-gpu
title: GPU
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [GPU, graphics processing unit, NVIDIA, AMD, H100, A100, B200, accelerator]
duplicate_of: none
source_trust_level: A
confidence_score: 0.96
verification_status: applied
tags: [gpu, hardware, ai-infra, cuda, ml-acceleration, hpc]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: CUDA / HIP / Metal / WGSL
  framework: PyTorch / TensorRT / CUDA Toolkit
---

# GPU

## 매 한 줄
> **"매 SIMD parallel processor — 매 매 ML / graphics workhorse"**. 매 modern: 매 NVIDIA H100/B200, AMD MI300X, Apple Silicon, Google TPU. 매 ML compute 의 dominant. 매 SM, 매 tensor core, 매 HBM, 매 NVLink. 매 cost / availability 의 ML 의 strategic concern.

## 매 핵심

### 매 architecture (NVIDIA)
- **SM** (Streaming Multiprocessor).
- **CUDA Core** (FP32).
- **Tensor Core** (matrix mul, FP16/BF16/FP8/INT4).
- **Memory hierarchy**: HBM → L2 → L1/SMEM → registers.
- **Warp**: 32 threads.
- **Block**: 매 SM 의 schedule.

### 매 modern GPU (2024-2026)
- **NVIDIA H100** (Hopper): 매 80GB HBM3, 매 transformer engine, FP8.
- **NVIDIA B200** (Blackwell): 매 192GB HBM3e, FP4, 매 dual die.
- **AMD MI300X**: 매 192GB HBM3, 매 ROCm.
- **Apple Silicon** (M3, M4): 매 unified memory, MLX.
- **Google TPU v5p**: 매 systolic array, jax.

### 매 metric
- **TFLOPS**: 매 FP32 / FP16 / FP8.
- **Memory BW**: 매 HBM bandwidth.
- **Memory size**: 매 model fit.
- **NVLink** / Infiniband: 매 multi-GPU.
- **Power** (TDP).

### 매 응용
1. **ML training**: 매 matrix mul.
2. **ML inference**.
3. **Graphics**: 매 raster + RT.
4. **HPC**: 매 simulation.
5. **Crypto** (declining).
6. **Video** (encode/decode).

### 매 modern AI infra
- **Multi-GPU** (NVLink, NVSwitch).
- **Multi-node** (Infiniband, RoCE).
- **Distributed training** (FSDP, ZeRO, TP).
- **vLLM / TensorRT-LLM** for inference.
- **Quantization** (FP8, INT4).

## 💻 패턴

### Check GPU (PyTorch)
```python
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
print(torch.cuda.get_device_properties(0))
# 매 capability >= 7.0 → tensor core
# 매 >= 9.0 → Hopper / FP8
```

### Tensor (move to GPU)
```python
x = torch.randn(1024, 1024).cuda()
# 매 or
x = torch.randn(1024, 1024, device='cuda')
```

### Mixed precision (autocast)
```python
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast(dtype=torch.bfloat16):  # 매 H100 friendly
    loss = model(x)
scaler.scale(loss).backward()
scaler.step(optim)
scaler.update()
```

### FP8 (H100+)
```python
import transformer_engine.pytorch as te
fp8_recipe = te.recipe.DelayedScaling(
    margin=0, interval=1, fp8_format=te.recipe.Format.HYBRID,
)
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
    out = te_linear(x)
```

### Multi-GPU (DDP)
```python
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group('nccl')
model = DDP(model.cuda(), device_ids=[local_rank])
```

### FSDP (sharded)
```python
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, MixedPrecision
fsdp_model = FSDP(
    model,
    mixed_precision=MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.float32),
    device_id=local_rank,
)
```

### CUDA kernel (custom)
```cuda
__global__ void vector_add(float* a, float* b, float* c, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) c[i] = a[i] + b[i];
}

// 매 launch
vector_add<<<(N + 255) / 256, 256>>>(a, b, c, N);
```

### Triton (Python kernel)
```python
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, N, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offsets < N
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)
```

### MLX (Apple)
```python
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = mx.add(a, b)
# 매 unified memory, 매 lazy
```

### TensorRT (NVIDIA inference)
```python
import tensorrt as trt
builder = trt.Builder(logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)
engine = builder.build_serialized_network(network, config)
# 매 vs PyTorch 의 2-5x speedup
```

### vLLM (LLM serving)
```python
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-70B-Instruct', tensor_parallel_size=4)
outputs = llm.generate(prompts, SamplingParams(max_tokens=200))
```

### Memory profiling
```python
torch.cuda.reset_peak_memory_stats()
out = model(x)
print(f'Peak: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB')
print(f'Reserved: {torch.cuda.max_memory_reserved() / 1e9:.2f} GB')
```

### nvidia-smi (CLI)
```bash
nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv
nvtop  # 매 interactive
```

### Quantization (8-bit + 4-bit)
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained('llama-3-70b', quantization_config=bnb)
# 매 70B 의 single H100 의 fit
```

### Spot / preemptible
```python
# 매 cloud GPU cost 의 30-70% 의 cheaper
# 매 checkpoint 의 의 의 의 frequent
def train_with_checkpoint(model, every_n_steps=100):
    for step, batch in enumerate(loader):
        train_step(batch)
        if step % every_n_steps == 0:
            save_checkpoint(model, step)
```

### Compute capability check
```python
def best_dtype(device):
    cap = torch.cuda.get_device_capability(device)
    if cap >= (9, 0): return 'fp8'  # 매 H100, B200
    if cap >= (8, 0): return 'bf16'  # 매 A100, RTX 30+
    if cap >= (7, 0): return 'fp16'  # 매 V100, RTX 20+
    return 'fp32'
```

## 매 결정 기준
| 상황 | GPU |
|---|---|
| Frontier training | H100 / B200 cluster |
| Cost-aware ML | A100 / L40S |
| LLM inference | H100 + vLLM |
| On-device | M3 Max / RTX 4090 |
| Workstation | RTX 6000 Ada |
| AMD ecosystem | MI300X + ROCm |
| Edge | Jetson Orin |

**기본값**: 매 NVIDIA + CUDA + bf16 + FSDP + vLLM. 매 매 cost = quantization + spot + multi-GPU efficient.

## 🔗 Graph
- 부모: [[Hardware]]
- 변형: [[GPU|GPU-Architecture]] · [[CUDA]] · [[Tensor-Core]]
- 응용: [[GPU-Programming-with-CUDA]] · [[Flash Attention]] · [[Distributed-Training]]
- Adjacent: [[TPU]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]] · [[Edge-AI-and-Computing]] · [[LLM_Optimization_and_Deployment_Strategies|vLLM]]

## 🤖 LLM 활용
**언제**: 매 ML training/inference. 매 graphics. 매 HPC.
**언제 X**: 매 CPU-only sufficient.

## ❌ 안티패턴
- **FP32 only**: 매 modern hardware 의 waste.
- **No multi-GPU sharding for big**: 매 OOM.
- **Cloud spot 의 no checkpoint**: 매 lose progress.
- **Same dtype for old + new**: 매 capability mismatch.

## 🧪 검증 / 중복
- Verified (NVIDIA whitepapers, AMD CDNA, Apple MLX, vLLM).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-04-26 | Auto |
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — architecture + 매 PyTorch / Triton / MLX / TRT / vLLM code |