f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7.8 KiB
7.8 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-pytorch-lightning | PyTorch Lightning | 10_Wiki/Topics | verified | self |
|
none | A | 0.9 | applied |
|
2026-05-10 | pending |
|
PyTorch Lightning
매 한 줄
"매 PyTorch boilerplate 의 elimination — research-style structured trainer". LightningModule (model + optim + step) + Trainer (loop + distributed + logging) 의 separation. 2026 현재 매 still strong for research / classical DL, 매 LLM-era 의 HuggingFace Trainer / Accelerate / TRL 의 dominate.
매 핵심
매 LightningModule lifecycle
__init__: model + hparams.forward(x): inference.training_step(batch, idx) -> loss: per-batch train.validation_step/test_step: eval.configure_optimizers() -> optim | (optim, sched): opt + scheduler.on_*_epoch_endhooks for aggregation.
매 Trainer features
- Multi-GPU (DDP, FSDP), TPU, MPS automatic.
- Mixed precision (
precision="bf16-mixed"). - Gradient accumulation, clipping built-in.
- Callbacks (EarlyStopping, ModelCheckpoint, LR monitor).
- Loggers (TensorBoard, WandB, MLflow, CSV).
fast_dev_run,overfit_batches,limit_*_batchesfor debug.
매 vs alternatives (2026)
| Framework | Best for |
|---|---|
| Lightning | research, classical CV/NLP, structured projects |
| HF Trainer | HF-ecosystem (transformers + datasets), LLM SFT |
| HF Accelerate | minimal wrapper, retain raw PyTorch loop |
| TRL | RLHF / DPO / GRPO, LLM post-training |
| MosaicML Composer | streaming, throughput-optimized |
| raw PyTorch | full control, simple scripts |
매 응용
- CV training (image classification, segmentation, detection).
- Tabular DL (TabNet, FT-Transformer).
- Audio / speech (W2V2 finetune).
- Mid-size LLM finetune (when not using HF Trainer).
- Self-supervised pretraining (SimCLR, MAE).
💻 패턴
Minimal LightningModule
import lightning as L
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
class LitClassifier(L.LightningModule):
def __init__(self, lr=1e-3):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential(
nn.Flatten(), nn.Linear(28*28, 256), nn.ReLU(), nn.Linear(256, 10),
)
self.loss = nn.CrossEntropyLoss()
def forward(self, x):
return self.net(x)
def training_step(self, batch, idx):
x, y = batch
logits = self(x)
loss = self.loss(logits, y)
self.log("train_loss", loss, prog_bar=True)
return loss
def validation_step(self, batch, idx):
x, y = batch
logits = self(x)
acc = (logits.argmax(-1) == y).float().mean()
self.log("val_acc", acc, prog_bar=True)
def configure_optimizers(self):
opt = torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=10)
return [opt], [sched]
Trainer with callbacks
from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint, LearningRateMonitor
from lightning.pytorch.loggers import WandbLogger
trainer = L.Trainer(
max_epochs=20,
accelerator="auto", # cuda / mps / cpu
devices="auto",
precision="bf16-mixed",
accumulate_grad_batches=4,
gradient_clip_val=1.0,
callbacks=[
EarlyStopping(monitor="val_acc", mode="max", patience=3),
ModelCheckpoint(monitor="val_acc", mode="max", save_top_k=2),
LearningRateMonitor(),
],
logger=WandbLogger(project="lit-mnist"),
)
trainer.fit(LitClassifier(), train_dl, val_dl)
Multi-GPU DDP
trainer = L.Trainer(
accelerator="gpu",
devices=4,
strategy="ddp", # or "fsdp" for >7B params
precision="bf16-mixed",
sync_batchnorm=True,
)
# 매 launch with `python train.py` — Lightning 의 spawn workers
FSDP for large model
from lightning.pytorch.strategies import FSDPStrategy
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from functools import partial
policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={MyTransformerBlock})
trainer = L.Trainer(
devices=8,
strategy=FSDPStrategy(auto_wrap_policy=policy, cpu_offload=False),
precision="bf16-mixed",
)
LightningDataModule
class MNISTDataModule(L.LightningDataModule):
def __init__(self, batch_size=64):
super().__init__()
self.bs = batch_size
def prepare_data(self):
from torchvision.datasets import MNIST
MNIST(".", train=True, download=True)
def setup(self, stage=None):
from torchvision.datasets import MNIST
from torchvision import transforms
t = transforms.ToTensor()
self.train = MNIST(".", train=True, transform=t)
self.val = MNIST(".", train=False, transform=t)
def train_dataloader(self):
return DataLoader(self.train, batch_size=self.bs, num_workers=4, shuffle=True)
def val_dataloader(self):
return DataLoader(self.val, batch_size=self.bs, num_workers=4)
LightningCLI (config-driven)
# train.py
from lightning.pytorch.cli import LightningCLI
def main():
LightningCLI(LitClassifier, MNISTDataModule)
if __name__ == "__main__":
main()
# python train.py fit --config config.yaml --trainer.max_epochs=30
Resume from checkpoint
trainer.fit(model, datamodule, ckpt_path="lightning_logs/version_3/checkpoints/last.ckpt")
# or load model standalone
model = LitClassifier.load_from_checkpoint("path.ckpt")
Manual optimization (GAN, RL)
class LitGAN(L.LightningModule):
def __init__(self):
super().__init__()
self.automatic_optimization = False
...
def training_step(self, batch, idx):
opt_g, opt_d = self.optimizers()
# discriminator step
opt_d.zero_grad(); d_loss = ...; self.manual_backward(d_loss); opt_d.step()
# generator step
opt_g.zero_grad(); g_loss = ...; self.manual_backward(g_loss); opt_g.step()
매 결정 기준
| 상황 | Approach |
|---|---|
| Research, multi-experiment, structured | Lightning |
| HF transformers SFT | HF Trainer (closer to ecosystem) |
| Custom training loop, retain control | Accelerate |
| LLM RLHF / DPO / GRPO | TRL |
| Single-GPU script <100 lines | raw PyTorch |
| Need callbacks + DDP fast | Lightning |
기본값: 매 research / non-HF training 의 Lightning + bf16-mixed + DDP. 매 HF transformers job 의 HF Trainer. 매 LLM post-training 의 TRL.
🔗 Graph
🤖 LLM 활용
언제: scaffold LightningModule from arch description, generate callback config, debug DDP issues. 언제 X: deep performance tuning (FSDP wrap policy, custom strategy) — 매 verify with profiler, 매 LLM 의 outdated API common.
❌ 안티패턴
.cuda()inside LightningModule: Lightning manages device — useself.deviceor just rely on Trainer.- Manual DDP setup: Lightning handles, don't double-wrap.
- Logging in DDP without
sync_dist=True: rank-0 only logs, miss aggregation. automatic_optimization=Truefor GAN: silent wrong loss flow — manual mode.- Pinning to old Lightning 1.x: 매 2.x API change (lightning.pytorch namespace), 매 2026 의 2.x+ standard.
🧪 검증 / 중복
- Verified (lightning.ai docs 2026, Lightning 2.x release notes, Falcon 2019 origin paper, Lightning Studios).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — LightningModule + Trainer + DDP/FSDP patterns, 2026 alt comparison |