f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.3 KiB
7.3 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-gpu | GPU | 10_Wiki/Topics | verified | self |
|
none | A | 0.96 | applied |
|
2026-05-10 | pending |
|
GPU
매 한 줄
"매 SIMD parallel processor — 매 매 ML / graphics workhorse". 매 modern: 매 NVIDIA H100/B200, AMD MI300X, Apple Silicon, Google TPU. 매 ML compute 의 dominant. 매 SM, 매 tensor core, 매 HBM, 매 NVLink. 매 cost / availability 의 ML 의 strategic concern.
매 핵심
매 architecture (NVIDIA)
- SM (Streaming Multiprocessor).
- CUDA Core (FP32).
- Tensor Core (matrix mul, FP16/BF16/FP8/INT4).
- Memory hierarchy: HBM → L2 → L1/SMEM → registers.
- Warp: 32 threads.
- Block: 매 SM 의 schedule.
매 modern GPU (2024-2026)
- NVIDIA H100 (Hopper): 매 80GB HBM3, 매 transformer engine, FP8.
- NVIDIA B200 (Blackwell): 매 192GB HBM3e, FP4, 매 dual die.
- AMD MI300X: 매 192GB HBM3, 매 ROCm.
- Apple Silicon (M3, M4): 매 unified memory, MLX.
- Google TPU v5p: 매 systolic array, jax.
매 metric
- TFLOPS: 매 FP32 / FP16 / FP8.
- Memory BW: 매 HBM bandwidth.
- Memory size: 매 model fit.
- NVLink / Infiniband: 매 multi-GPU.
- Power (TDP).
매 응용
- ML training: 매 matrix mul.
- ML inference.
- Graphics: 매 raster + RT.
- HPC: 매 simulation.
- Crypto (declining).
- Video (encode/decode).
매 modern AI infra
- Multi-GPU (NVLink, NVSwitch).
- Multi-node (Infiniband, RoCE).
- Distributed training (FSDP, ZeRO, TP).
- vLLM / TensorRT-LLM for inference.
- Quantization (FP8, INT4).
💻 패턴
Check GPU (PyTorch)
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
print(torch.cuda.get_device_properties(0))
# 매 capability >= 7.0 → tensor core
# 매 >= 9.0 → Hopper / FP8
Tensor (move to GPU)
x = torch.randn(1024, 1024).cuda()
# 매 or
x = torch.randn(1024, 1024, device='cuda')
Mixed precision (autocast)
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast(dtype=torch.bfloat16): # 매 H100 friendly
loss = model(x)
scaler.scale(loss).backward()
scaler.step(optim)
scaler.update()
FP8 (H100+)
import transformer_engine.pytorch as te
fp8_recipe = te.recipe.DelayedScaling(
margin=0, interval=1, fp8_format=te.recipe.Format.HYBRID,
)
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
out = te_linear(x)
Multi-GPU (DDP)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group('nccl')
model = DDP(model.cuda(), device_ids=[local_rank])
FSDP (sharded)
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, MixedPrecision
fsdp_model = FSDP(
model,
mixed_precision=MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.float32),
device_id=local_rank,
)
CUDA kernel (custom)
__global__ void vector_add(float* a, float* b, float* c, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) c[i] = a[i] + b[i];
}
// 매 launch
vector_add<<<(N + 255) / 256, 256>>>(a, b, c, N);
Triton (Python kernel)
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, N, BLOCK: tl.constexpr):
pid = tl.program_id(0)
offsets = pid * BLOCK + tl.arange(0, BLOCK)
mask = offsets < N
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
tl.store(out_ptr + offsets, x + y, mask=mask)
MLX (Apple)
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = mx.add(a, b)
# 매 unified memory, 매 lazy
TensorRT (NVIDIA inference)
import tensorrt as trt
builder = trt.Builder(logger)
config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)
engine = builder.build_serialized_network(network, config)
# 매 vs PyTorch 의 2-5x speedup
vLLM (LLM serving)
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3.1-70B-Instruct', tensor_parallel_size=4)
outputs = llm.generate(prompts, SamplingParams(max_tokens=200))
Memory profiling
torch.cuda.reset_peak_memory_stats()
out = model(x)
print(f'Peak: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB')
print(f'Reserved: {torch.cuda.max_memory_reserved() / 1e9:.2f} GB')
nvidia-smi (CLI)
nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv
nvtop # 매 interactive
Quantization (8-bit + 4-bit)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained('llama-3-70b', quantization_config=bnb)
# 매 70B 의 single H100 의 fit
Spot / preemptible
# 매 cloud GPU cost 의 30-70% 의 cheaper
# 매 checkpoint 의 의 의 의 frequent
def train_with_checkpoint(model, every_n_steps=100):
for step, batch in enumerate(loader):
train_step(batch)
if step % every_n_steps == 0:
save_checkpoint(model, step)
Compute capability check
def best_dtype(device):
cap = torch.cuda.get_device_capability(device)
if cap >= (9, 0): return 'fp8' # 매 H100, B200
if cap >= (8, 0): return 'bf16' # 매 A100, RTX 30+
if cap >= (7, 0): return 'fp16' # 매 V100, RTX 20+
return 'fp32'
매 결정 기준
| 상황 | GPU |
|---|---|
| Frontier training | H100 / B200 cluster |
| Cost-aware ML | A100 / L40S |
| LLM inference | H100 + vLLM |
| On-device | M3 Max / RTX 4090 |
| Workstation | RTX 6000 Ada |
| AMD ecosystem | MI300X + ROCm |
| Edge | Jetson Orin |
기본값: 매 NVIDIA + CUDA + bf16 + FSDP + vLLM. 매 매 cost = quantization + spot + multi-GPU efficient.
🔗 Graph
- 부모: Hardware
- 변형: GPU · CUDA · Tensor-Core
- 응용: GPU-Programming-with-CUDA · Flash Attention · Distributed-Training
- Adjacent: TPU · LLM_Optimization_and_Deployment_Strategies · Edge-AI-and-Computing · LLM_Optimization_and_Deployment_Strategies
🤖 LLM 활용
언제: 매 ML training/inference. 매 graphics. 매 HPC. 언제 X: 매 CPU-only sufficient.
❌ 안티패턴
- FP32 only: 매 modern hardware 의 waste.
- No multi-GPU sharding for big: 매 OOM.
- Cloud spot 의 no checkpoint: 매 lose progress.
- Same dtype for old + new: 매 capability mismatch.
🧪 검증 / 중복
- Verified (NVIDIA whitepapers, AMD CDNA, Apple MLX, vLLM).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-04-26 | Auto |
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — architecture + 매 PyTorch / Triton / MLX / TRT / vLLM code |