--- id: wiki-2026-0508-gpu title: GPU category: 10_Wiki/Topics status: verified canonical_id: self aliases: [GPU, graphics processing unit, NVIDIA, AMD, H100, A100, B200, accelerator] duplicate_of: none source_trust_level: A confidence_score: 0.96 verification_status: applied tags: [gpu, hardware, ai-infra, cuda, ml-acceleration, hpc] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: CUDA / HIP / Metal / WGSL framework: PyTorch / TensorRT / CUDA Toolkit --- # GPU ## 매 한 줄 > **"매 SIMD parallel processor — 매 매 ML / graphics workhorse"**. 매 modern: 매 NVIDIA H100/B200, AMD MI300X, Apple Silicon, Google TPU. 매 ML compute 의 dominant. 매 SM, 매 tensor core, 매 HBM, 매 NVLink. 매 cost / availability 의 ML 의 strategic concern. ## 매 핵심 ### 매 architecture (NVIDIA) - **SM** (Streaming Multiprocessor). - **CUDA Core** (FP32). - **Tensor Core** (matrix mul, FP16/BF16/FP8/INT4). - **Memory hierarchy**: HBM → L2 → L1/SMEM → registers. - **Warp**: 32 threads. - **Block**: 매 SM 의 schedule. ### 매 modern GPU (2024-2026) - **NVIDIA H100** (Hopper): 매 80GB HBM3, 매 transformer engine, FP8. - **NVIDIA B200** (Blackwell): 매 192GB HBM3e, FP4, 매 dual die. - **AMD MI300X**: 매 192GB HBM3, 매 ROCm. - **Apple Silicon** (M3, M4): 매 unified memory, MLX. - **Google TPU v5p**: 매 systolic array, jax. ### 매 metric - **TFLOPS**: 매 FP32 / FP16 / FP8. - **Memory BW**: 매 HBM bandwidth. - **Memory size**: 매 model fit. - **NVLink** / Infiniband: 매 multi-GPU. - **Power** (TDP). ### 매 응용 1. **ML training**: 매 matrix mul. 2. **ML inference**. 3. **Graphics**: 매 raster + RT. 4. **HPC**: 매 simulation. 5. **Crypto** (declining). 6. **Video** (encode/decode). ### 매 modern AI infra - **Multi-GPU** (NVLink, NVSwitch). - **Multi-node** (Infiniband, RoCE). - **Distributed training** (FSDP, ZeRO, TP). - **vLLM / TensorRT-LLM** for inference. - **Quantization** (FP8, INT4). ## 💻 패턴 ### Check GPU (PyTorch) ```python import torch print(torch.cuda.is_available()) print(torch.cuda.device_count()) print(torch.cuda.get_device_name(0)) print(torch.cuda.get_device_properties(0)) # 매 capability >= 7.0 → tensor core # 매 >= 9.0 → Hopper / FP8 ``` ### Tensor (move to GPU) ```python x = torch.randn(1024, 1024).cuda() # 매 or x = torch.randn(1024, 1024, device='cuda') ``` ### Mixed precision (autocast) ```python from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() with autocast(dtype=torch.bfloat16): # 매 H100 friendly loss = model(x) scaler.scale(loss).backward() scaler.step(optim) scaler.update() ``` ### FP8 (H100+) ```python import transformer_engine.pytorch as te fp8_recipe = te.recipe.DelayedScaling( margin=0, interval=1, fp8_format=te.recipe.Format.HYBRID, ) with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe): out = te_linear(x) ``` ### Multi-GPU (DDP) ```python import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP dist.init_process_group('nccl') model = DDP(model.cuda(), device_ids=[local_rank]) ``` ### FSDP (sharded) ```python from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, MixedPrecision fsdp_model = FSDP( model, mixed_precision=MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.float32), device_id=local_rank, ) ``` ### CUDA kernel (custom) ```cuda __global__ void vector_add(float* a, float* b, float* c, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) c[i] = a[i] + b[i]; } // 매 launch vector_add<<<(N + 255) / 256, 256>>>(a, b, c, N); ``` ### Triton (Python kernel) ```python import triton import triton.language as tl @triton.jit def add_kernel(x_ptr, y_ptr, out_ptr, N, BLOCK: tl.constexpr): pid = tl.program_id(0) offsets = pid * BLOCK + tl.arange(0, BLOCK) mask = offsets < N x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) tl.store(out_ptr + offsets, x + y, mask=mask) ``` ### MLX (Apple) ```python import mlx.core as mx a = mx.array([1.0, 2.0, 3.0]) b = mx.array([4.0, 5.0, 6.0]) c = mx.add(a, b) # 매 unified memory, 매 lazy ``` ### TensorRT (NVIDIA inference) ```python import tensorrt as trt builder = trt.Builder(logger) config = builder.create_builder_config() config.set_flag(trt.BuilderFlag.FP16) engine = builder.build_serialized_network(network, config) # 매 vs PyTorch 의 2-5x speedup ``` ### vLLM (LLM serving) ```python from vllm import LLM, SamplingParams llm = LLM(model='meta-llama/Llama-3.1-70B-Instruct', tensor_parallel_size=4) outputs = llm.generate(prompts, SamplingParams(max_tokens=200)) ``` ### Memory profiling ```python torch.cuda.reset_peak_memory_stats() out = model(x) print(f'Peak: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB') print(f'Reserved: {torch.cuda.max_memory_reserved() / 1e9:.2f} GB') ``` ### nvidia-smi (CLI) ```bash nvidia-smi --query-gpu=name,memory.used,memory.total,utilization.gpu --format=csv nvtop # 매 interactive ``` ### Quantization (8-bit + 4-bit) ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16) model = AutoModelForCausalLM.from_pretrained('llama-3-70b', quantization_config=bnb) # 매 70B 의 single H100 의 fit ``` ### Spot / preemptible ```python # 매 cloud GPU cost 의 30-70% 의 cheaper # 매 checkpoint 의 의 의 의 frequent def train_with_checkpoint(model, every_n_steps=100): for step, batch in enumerate(loader): train_step(batch) if step % every_n_steps == 0: save_checkpoint(model, step) ``` ### Compute capability check ```python def best_dtype(device): cap = torch.cuda.get_device_capability(device) if cap >= (9, 0): return 'fp8' # 매 H100, B200 if cap >= (8, 0): return 'bf16' # 매 A100, RTX 30+ if cap >= (7, 0): return 'fp16' # 매 V100, RTX 20+ return 'fp32' ``` ## 매 결정 기준 | 상황 | GPU | |---|---| | Frontier training | H100 / B200 cluster | | Cost-aware ML | A100 / L40S | | LLM inference | H100 + vLLM | | On-device | M3 Max / RTX 4090 | | Workstation | RTX 6000 Ada | | AMD ecosystem | MI300X + ROCm | | Edge | Jetson Orin | **기본값**: 매 NVIDIA + CUDA + bf16 + FSDP + vLLM. 매 매 cost = quantization + spot + multi-GPU efficient. ## 🔗 Graph - 부모: [[Hardware]] - 변형: [[GPU|GPU-Architecture]] · [[CUDA]] · [[Tensor-Core]] - 응용: [[GPU-Programming-with-CUDA]] · [[Flash Attention]] · [[Distributed-Training]] - Adjacent: [[TPU]] · [[LLM_Optimization_and_Deployment_Strategies|Quantization]] · [[Edge-AI-and-Computing]] · [[LLM_Optimization_and_Deployment_Strategies|vLLM]] ## 🤖 LLM 활용 **언제**: 매 ML training/inference. 매 graphics. 매 HPC. **언제 X**: 매 CPU-only sufficient. ## ❌ 안티패턴 - **FP32 only**: 매 modern hardware 의 waste. - **No multi-GPU sharding for big**: 매 OOM. - **Cloud spot 의 no checkpoint**: 매 lose progress. - **Same dtype for old + new**: 매 capability mismatch. ## 🧪 검증 / 중복 - Verified (NVIDIA whitepapers, AMD CDNA, Apple MLX, vLLM). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-04-26 | Auto | | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — architecture + 매 PyTorch / Triton / MLX / TRT / vLLM code |