f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
259 lines
7.3 KiB
Markdown
259 lines
7.3 KiB
Markdown
---
|
|
id: wiki-2026-0508-gpu
|
|
title: GPU
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [GPU, graphics processing unit, NVIDIA, AMD, H100, A100, B200, accelerator]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.96
|
|
verification_status: applied
|
|
tags: [gpu, hardware, ai-infra, cuda, ml-acceleration, hpc]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: CUDA / HIP / Metal / WGSL
|
|
framework: PyTorch / TensorRT / CUDA Toolkit
|
|
---
|
|
|
|
# GPU
|
|
|
|
## 매 한 줄
|
|
> **"매 SIMD parallel processor — 매 매 ML / graphics workhorse"**. 매 modern: 매 NVIDIA H100/B200, AMD MI300X, Apple Silicon, Google TPU. 매 ML compute 의 dominant. 매 SM, 매 tensor core, 매 HBM, 매 NVLink. 매 cost / availability 의 ML 의 strategic concern.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 architecture (NVIDIA)
|
|
- **SM** (Streaming Multiprocessor).
|
|
- **CUDA Core** (FP32).
|
|
- **Tensor Core** (matrix mul, FP16/BF16/FP8/INT4).
|
|
- **Memory hierarchy**: HBM → L2 → L1/SMEM → registers.
|
|
- **Warp**: 32 threads.
|
|
- **Block**: 매 SM 의 schedule.
|
|
|
|
### 매 modern GPU (2024-2026)
|
|
- **NVIDIA H100** (Hopper): 매 80GB HBM3, 매 transformer engine, FP8.
|
|
- **NVIDIA B200** (Blackwell): 매 192GB HBM3e, FP4, 매 dual die.
|
|
- **AMD MI300X**: 매 192GB HBM3, 매 ROCm.
|
|
- **Apple Silicon** (M3, M4): 매 unified memory, MLX.
|
|
- **Google TPU v5p**: 매 systolic array, jax.
|
|
|
|
### 매 metric
|
|
- **TFLOPS**: 매 FP32 / FP16 / FP8.
|
|
- **Memory BW**: 매 HBM bandwidth.
|
|
- **Memory size**: 매 model fit.
|
|
- **NVLink** / Infiniband: 매 multi-GPU.
|
|
- **Power** (TDP).
|
|
|
|
### 매 응용
|
|
1. **ML training**: 매 matrix mul.
|
|
2. **ML inference**.
|
|
3. **Graphics**: 매 raster + RT.
|
|
4. **HPC**: 매 simulation.
|
|
5. **Crypto** (declining).
|
|
6. **Video** (encode/decode).
|
|
|
|
### 매 modern AI infra
|
|
- **Multi-GPU** (NVLink, NVSwitch).
|
|
- **Multi-node** (Infiniband, RoCE).
|
|
- **Distributed training** (FSDP, ZeRO, TP).
|
|
- **vLLM / TensorRT-LLM** for inference.
|
|
- **Quantization** (FP8, INT4).
|
|
|
|
## 💻 패턴
|
|
|
|
### Check GPU (PyTorch)
|
|
```python
|
|
import torch
|
|
print(torch.cuda.is_available())
|
|
print(torch.cuda.device_count())
|
|
print(torch.cuda.get_device_name(0))
|
|
print(torch.cuda.get_device_properties(0))
|
|
# 매 capability >= 7.0 → tensor core
|
|
# 매 >= 9.0 → Hopper / FP8
|
|
```
|
|
|
|
### Tensor (move to GPU)
|
|
```python
|
|
x = torch.randn(1024, 1024).cuda()
|
|
# 매 or
|
|
x = torch.randn(1024, 1024, device='cuda')
|
|
```
|
|
|
|
### Mixed precision (autocast)
|
|
```python
|
|
from torch.cuda.amp import autocast, GradScaler
|
|
scaler = GradScaler()
|
|
with autocast(dtype=torch.bfloat16): # 매 H100 friendly
|
|
loss = model(x)
|
|
scaler.scale(loss).backward()
|
|
scaler.step(optim)
|
|
scaler.update()
|
|
```
|
|
|
|
### FP8 (H100+)
|
|
```python
|
|
import transformer_engine.pytorch as te
|
|
fp8_recipe = te.recipe.DelayedScaling(
|
|
margin=0, interval=1, fp8_format=te.recipe.Format.HYBRID,
|
|
)
|
|
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
|
|
out = te_linear(x)
|
|
```
|
|
|
|
### Multi-GPU (DDP)
|
|
```python
|
|
import torch.distributed as dist
|
|
from torch.nn.parallel import DistributedDataParallel as DDP
|
|
|
|
dist.init_process_group('nccl')
|
|
model = DDP(model.cuda(), device_ids=[local_rank])
|
|
```
|
|
|
|
### FSDP (sharded)
|
|
```python
|
|
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, MixedPrecision
|
|
fsdp_model = FSDP(
|
|
model,
|
|
mixed_precision=MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.float32),
|
|
device_id=local_rank,
|
|
)
|
|
```
|
|
|
|
### CUDA kernel (custom)
|
|
```cuda
|
|
__global__ void vector_add(float* a, float* b, float* c, int N) {
|
|
int i = blockIdx.x * blockDim.x + threadIdx.x;
|
|
if (i < N) c[i] = a[i] + b[i];
|
|
}
|
|
|
|
// 매 launch
|
|
vector_add<<<(N + 255) / 256, 256>>>(a, b, c, N);
|
|
```
|
|
|
|
### Triton (Python kernel)
|
|
```python
|
|
import triton
|
|
import triton.language as tl
|
|
|
|
@triton.jit
|
|
def add_kernel(x_ptr, y_ptr, out_ptr, N, BLOCK: tl.constexpr):
|
|
pid = tl.program_id(0)
|
|
offsets = pid * BLOCK + tl.arange(0, BLOCK)
|
|
mask = offsets < N
|
|
x = tl.load(x_ptr + offsets, mask=mask)
|
|
y = tl.load(y_ptr + offsets, mask=mask)
|
|
tl.store(out_ptr + offsets, x + y, mask=mask)
|
|
```
|
|
|
|
### MLX (Apple)
|
|
```python
|
|
import mlx.core as mx
|
|
a = mx.array([1.0, 2.0, 3.0])
|
|
b = mx.array([4.0, 5.0, 6.0])
|
|
c = mx.add(a, b)
|
|
# 매 unified memory, 매 lazy
|
|
```
|
|
|
|
### TensorRT (NVIDIA inference)
|
|
```python
|
|
import tensorrt as trt
|
|
builder = trt.Builder(logger)
|
|
config = builder.create_builder_config()
|
|
config.set_flag(trt.BuilderFlag.FP16)
|
|
engine = builder.build_serialized_network(network, config)
|
|
# 매 vs PyTorch 의 2-5x speedup
|
|
```
|
|
|
|
### vLLM (LLM serving)
|
|
```python
|
|
from vllm import LLM, SamplingParams
|
|
llm = LLM(model='meta-llama/Llama-3.1-70B-Instruct', tensor_parallel_size=4)
|
|
outputs = llm.generate(prompts, SamplingParams(max_tokens=200))
|
|
```
|
|
|
|
### Memory profiling
|
|
```python
|
|
torch.cuda.reset_peak_memory_stats()
|
|
out = model(x)
|
|
print(f'Peak: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB')
|
|
print(f'Reserved: {torch.cuda.max_memory_reserved() / 1e9:.2f} GB')
|
|
```
|
|
|
|
### nvidia-smi (CLI)
|
|
```bash
|
|
nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv
|
|
nvtop # 매 interactive
|
|
```
|
|
|
|
### Quantization (8-bit + 4-bit)
|
|
```python
|
|
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
|
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16)
|
|
model = AutoModelForCausalLM.from_pretrained('llama-3-70b', quantization_config=bnb)
|
|
# 매 70B 의 single H100 의 fit
|
|
```
|
|
|
|
### Spot / preemptible
|
|
```python
|
|
# 매 cloud GPU cost 의 30-70% 의 cheaper
|
|
# 매 checkpoint 의 의 의 의 frequent
|
|
def train_with_checkpoint(model, every_n_steps=100):
|
|
for step, batch in enumerate(loader):
|
|
train_step(batch)
|
|
if step % every_n_steps == 0:
|
|
save_checkpoint(model, step)
|
|
```
|
|
|
|
### Compute capability check
|
|
```python
|
|
def best_dtype(device):
|
|
cap = torch.cuda.get_device_capability(device)
|
|
if cap >= (9, 0): return 'fp8' # 매 H100, B200
|
|
if cap >= (8, 0): return 'bf16' # 매 A100, RTX 30+
|
|
if cap >= (7, 0): return 'fp16' # 매 V100, RTX 20+
|
|
return 'fp32'
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | GPU |
|
|
|---|---|
|
|
| Frontier training | H100 / B200 cluster |
|
|
| Cost-aware ML | A100 / L40S |
|
|
| LLM inference | H100 + vLLM |
|
|
| On-device | M3 Max / RTX 4090 |
|
|
| Workstation | RTX 6000 Ada |
|
|
| AMD ecosystem | MI300X + ROCm |
|
|
| Edge | Jetson Orin |
|
|
|
|
**기본값**: 매 NVIDIA + CUDA + bf16 + FSDP + vLLM. 매 매 cost = quantization + spot + multi-GPU efficient.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Hardware]]
|
|
- 변형: [[GPU|GPU-Architecture]] · [[CUDA]] · [[Tensor-Core]]
|
|
- 응용: [[GPU-Programming-with-CUDA]] · [[Flash Attention]] · [[Distributed-Training]]
|
|
- Adjacent: [[TPU]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]] · [[Edge-AI-and-Computing]] · [[LLM_Optimization_and_Deployment_Strategies|vLLM]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 ML training/inference. 매 graphics. 매 HPC.
|
|
**언제 X**: 매 CPU-only sufficient.
|
|
|
|
## ❌ 안티패턴
|
|
- **FP32 only**: 매 modern hardware 의 waste.
|
|
- **No multi-GPU sharding for big**: 매 OOM.
|
|
- **Cloud spot 의 no checkpoint**: 매 lose progress.
|
|
- **Same dtype for old + new**: 매 capability mismatch.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (NVIDIA whitepapers, AMD CDNA, Apple MLX, vLLM).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-04-26 | Auto |
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — architecture + 매 PyTorch / Triton / MLX / TRT / vLLM code |
|