---
id: wiki-2026-0508-pytorch-foundations
title: PyTorch Foundations
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [PyTorch Basics, PyTorch Core, torch fundamentals]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [pytorch, deep-learning, tensors, autograd]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: PyTorch-2.x
---

# PyTorch Foundations

## 매 한 줄
> **"매 Tensor + autograd + nn.Module + DataLoader"**. 2016 Soumith Chintala @ Meta 가 release. 매 NumPy-like + GPU + automatic differentiation. 매 2026 현재 PyTorch 2.x — `torch.compile`, FSDP2, MPS backend, torch.func — 가 매 default DL framework.

## 매 핵심

### 매 4 pillars
- **Tensor**: 매 N-d array, GPU/CPU/MPS, autograd-tracked.
- **Autograd**: 매 reverse-mode AD — `.backward()`.
- **nn.Module**: 매 layer + state container.
- **DataLoader**: 매 batched + parallel data pipeline.

### 매 device
- **CUDA**: NVIDIA. 매 production default.
- **MPS**: Apple Silicon. 매 dev-machine.
- **ROCm**: AMD. 매 growing.
- **XPU**: Intel.

### 매 응용
1. Vision (timm, torchvision).
2. NLP / LLM (transformers, vLLM 의 backend).
3. Diffusion (diffusers).
4. RL (cleanrl, torchrl).
5. Scientific ML (PINN, geometric DL).

## 💻 패턴

### Tensor basics
```python
import torch

x = torch.randn(3, 4, device="cuda", dtype=torch.float32)
y = torch.arange(12).reshape(3, 4).float().cuda()

z = x @ y.T          # matmul
w = x.mean(dim=0)    # reduction
print(x.shape, x.dtype, x.device)
```

### Autograd
```python
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = (x ** 2).sum()
y.backward()
print(x.grad)  # tensor([2., 4., 6.])
```

### nn.Module
```python
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, d_in, d_h, d_out):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, d_h), nn.GELU(),
            nn.Linear(d_h, d_h), nn.GELU(),
            nn.Linear(d_h, d_out),
        )
    def forward(self, x):
        return self.net(x)

model = MLP(784, 256, 10).cuda()
```

### Training loop (canonical)
```python
from torch.utils.data import DataLoader

opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
loss_fn = nn.CrossEntropyLoss()
loader = DataLoader(dataset, batch_size=128, shuffle=True,
                    num_workers=4, pin_memory=True)

for epoch in range(10):
    for x, y in loader:
        x, y = x.cuda(non_blocking=True), y.cuda(non_blocking=True)
        opt.zero_grad(set_to_none=True)
        logits = model(x)
        loss = loss_fn(logits, y)
        loss.backward()
        opt.step()
```

### torch.compile (2.x default)
```python
# 매 30-50% 속도 향상 의 free.
model = torch.compile(model, mode="reduce-overhead")
# mode: "default" | "reduce-overhead" | "max-autotune"
```

### Mixed precision (bf16 / amp)
```python
from torch.amp import autocast, GradScaler

scaler = GradScaler("cuda")

for x, y in loader:
    opt.zero_grad(set_to_none=True)
    with autocast(device_type="cuda", dtype=torch.bfloat16):
        loss = loss_fn(model(x), y)
    scaler.scale(loss).backward()
    scaler.step(opt)
    scaler.update()
```

### Custom Dataset
```python
from torch.utils.data import Dataset

class CSVDataset(Dataset):
    def __init__(self, path, transform=None):
        import pandas as pd
        self.df = pd.read_csv(path)
        self.transform = transform
    def __len__(self): return len(self.df)
    def __getitem__(self, i):
        row = self.df.iloc[i]
        x = torch.tensor(row[:-1].values, dtype=torch.float32)
        y = torch.tensor(row[-1], dtype=torch.long)
        return (self.transform(x), y) if self.transform else (x, y)
```

### Save / load
```python
# state_dict (recommended)
torch.save(model.state_dict(), "model.pt")
model.load_state_dict(torch.load("model.pt", weights_only=True))

# safetensors (preferred for sharing, no pickle RCE)
from safetensors.torch import save_file, load_file
save_file(model.state_dict(), "model.safetensors")
```

### Distributed (FSDP2, 2026 default for large)
```python
import torch.distributed as dist
from torch.distributed.fsdp import FSDPModule, fully_shard

dist.init_process_group("nccl")
model = MLP(...).cuda()
fully_shard(model)  # FSDP2 API
```

### torch.func (functional API)
```python
from torch.func import vmap, grad

def loss(params, x, y):
    return ((model_fn(params, x) - y) ** 2).mean()

per_sample_grads = vmap(grad(loss), in_dims=(None, 0, 0))(params, X, Y)
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Single-GPU train | `model.cuda()` + `torch.compile` |
| Multi-GPU same node | DDP |
| Model > GPU mem | **FSDP2** |
| Apple Silicon dev | MPS backend |
| Inference, llm-scale | vLLM / TensorRT-LLM |
| Quick prototype | Lightning or pure loop |

**기본값**: PyTorch 2.x + bf16 + torch.compile + AdamW.

## 🔗 Graph
- 부모: [[Deep-Learning]]
- 변형: [[JAX]] · [[TensorFlow]]
- 응용: [[Transformer_Architecture_and_LLM_Foundations|Transformers]] · [[Diffusion-Models]] · [[Reinforcement-Learning]]
- Adjacent: [[Lightning]] · [[Triton]]

## 🤖 LLM 활용
**언제**: 매 boilerplate training loop, 매 shape debug, 매 custom op skeleton.
**언제 X**: 매 hot-path numerical code 의 review 없이 trust X. 매 hallucinated API (e.g., 매 wrong autograd custom op).

## ❌ 안티패턴
- **`zero_grad()` 없이 backward**: 매 grad accumulate 의 silent bug.
- **`with torch.no_grad()` forget at eval**: 매 memory + 매 wrong stat.
- **CPU↔GPU 의 매 step transfer**: 매 PCIe bottleneck. 매 pin_memory + non_blocking.
- **In-place op 의 autograd-tracked tensor**: `x += 1` 의 backward 의 break.
- **`weights_only=False` (default 2.6+)**: pickle RCE risk. 매 always `weights_only=True`.
- **No `set_to_none=True`**: 매 zero-fill 의 wasteful.

## 🧪 검증 / 중복
- Verified (pytorch.org docs, PyTorch 2.x release notes).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — PyTorch 2.x foundations canonical. |