--- id: wiki-2026-0508-pytorch-lightning title: PyTorch Lightning category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Lightning, pl, lightning.pytorch] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [pytorch, training, framework, distributed] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch-lightning --- # PyTorch Lightning ## 매 한 줄 > **"매 PyTorch boilerplate 의 elimination — research-style structured trainer"**. LightningModule (model + optim + step) + Trainer (loop + distributed + logging) 의 separation. 2026 현재 매 still strong for research / classical DL, 매 LLM-era 의 HuggingFace Trainer / Accelerate / TRL 의 dominate. ## 매 핵심 ### 매 LightningModule lifecycle - `__init__`: model + hparams. - `forward(x)`: inference. - `training_step(batch, idx) -> loss`: per-batch train. - `validation_step` / `test_step`: eval. - `configure_optimizers() -> optim | (optim, sched)`: opt + scheduler. - `on_*_epoch_end` hooks for aggregation. ### 매 Trainer features - Multi-GPU (DDP, FSDP), TPU, MPS automatic. - Mixed precision (`precision="bf16-mixed"`). - Gradient accumulation, clipping built-in. - Callbacks (EarlyStopping, ModelCheckpoint, LR monitor). - Loggers (TensorBoard, WandB, MLflow, CSV). - `fast_dev_run`, `overfit_batches`, `limit_*_batches` for debug. ### 매 vs alternatives (2026) | Framework | Best for | |---|---| | Lightning | research, classical CV/NLP, structured projects | | HF Trainer | HF-ecosystem (transformers + datasets), LLM SFT | | HF Accelerate | minimal wrapper, retain raw PyTorch loop | | TRL | RLHF / DPO / GRPO, LLM post-training | | MosaicML Composer | streaming, throughput-optimized | | raw PyTorch | full control, simple scripts | ### 매 응용 1. CV training (image classification, segmentation, detection). 2. Tabular DL (TabNet, FT-Transformer). 3. Audio / speech (W2V2 finetune). 4. Mid-size LLM finetune (when not using HF Trainer). 5. Self-supervised pretraining (SimCLR, MAE). ## 💻 패턴 ### Minimal LightningModule ```python import lightning as L import torch import torch.nn as nn from torch.utils.data import DataLoader class LitClassifier(L.LightningModule): def __init__(self, lr=1e-3): super().__init__() self.save_hyperparameters() self.net = nn.Sequential( nn.Flatten(), nn.Linear(28*28, 256), nn.ReLU(), nn.Linear(256, 10), ) self.loss = nn.CrossEntropyLoss() def forward(self, x): return self.net(x) def training_step(self, batch, idx): x, y = batch logits = self(x) loss = self.loss(logits, y) self.log("train_loss", loss, prog_bar=True) return loss def validation_step(self, batch, idx): x, y = batch logits = self(x) acc = (logits.argmax(-1) == y).float().mean() self.log("val_acc", acc, prog_bar=True) def configure_optimizers(self): opt = torch.optim.AdamW(self.parameters(), lr=self.hparams.lr) sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=10) return [opt], [sched] ``` ### Trainer with callbacks ```python from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint, LearningRateMonitor from lightning.pytorch.loggers import WandbLogger trainer = L.Trainer( max_epochs=20, accelerator="auto", # cuda / mps / cpu devices="auto", precision="bf16-mixed", accumulate_grad_batches=4, gradient_clip_val=1.0, callbacks=[ EarlyStopping(monitor="val_acc", mode="max", patience=3), ModelCheckpoint(monitor="val_acc", mode="max", save_top_k=2), LearningRateMonitor(), ], logger=WandbLogger(project="lit-mnist"), ) trainer.fit(LitClassifier(), train_dl, val_dl) ``` ### Multi-GPU DDP ```python trainer = L.Trainer( accelerator="gpu", devices=4, strategy="ddp", # or "fsdp" for >7B params precision="bf16-mixed", sync_batchnorm=True, ) # 매 launch with `python train.py` — Lightning 의 spawn workers ``` ### FSDP for large model ```python from lightning.pytorch.strategies import FSDPStrategy from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy from functools import partial policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={MyTransformerBlock}) trainer = L.Trainer( devices=8, strategy=FSDPStrategy(auto_wrap_policy=policy, cpu_offload=False), precision="bf16-mixed", ) ``` ### LightningDataModule ```python class MNISTDataModule(L.LightningDataModule): def __init__(self, batch_size=64): super().__init__() self.bs = batch_size def prepare_data(self): from torchvision.datasets import MNIST MNIST(".", train=True, download=True) def setup(self, stage=None): from torchvision.datasets import MNIST from torchvision import transforms t = transforms.ToTensor() self.train = MNIST(".", train=True, transform=t) self.val = MNIST(".", train=False, transform=t) def train_dataloader(self): return DataLoader(self.train, batch_size=self.bs, num_workers=4, shuffle=True) def val_dataloader(self): return DataLoader(self.val, batch_size=self.bs, num_workers=4) ``` ### LightningCLI (config-driven) ```python # train.py from lightning.pytorch.cli import LightningCLI def main(): LightningCLI(LitClassifier, MNISTDataModule) if __name__ == "__main__": main() # python train.py fit --config config.yaml --trainer.max_epochs=30 ``` ### Resume from checkpoint ```python trainer.fit(model, datamodule, ckpt_path="lightning_logs/version_3/checkpoints/last.ckpt") # or load model standalone model = LitClassifier.load_from_checkpoint("path.ckpt") ``` ### Manual optimization (GAN, RL) ```python class LitGAN(L.LightningModule): def __init__(self): super().__init__() self.automatic_optimization = False ... def training_step(self, batch, idx): opt_g, opt_d = self.optimizers() # discriminator step opt_d.zero_grad(); d_loss = ...; self.manual_backward(d_loss); opt_d.step() # generator step opt_g.zero_grad(); g_loss = ...; self.manual_backward(g_loss); opt_g.step() ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Research, multi-experiment, structured | Lightning | | HF transformers SFT | HF Trainer (closer to ecosystem) | | Custom training loop, retain control | Accelerate | | LLM RLHF / DPO / GRPO | TRL | | Single-GPU script <100 lines | raw PyTorch | | Need callbacks + DDP fast | Lightning | **기본값**: 매 research / non-HF training 의 Lightning + bf16-mixed + DDP. 매 HF transformers job 의 HF Trainer. 매 LLM post-training 의 TRL. ## 🔗 Graph - 응용: [[Distributed-Training]] ## 🤖 LLM 활용 **언제**: scaffold LightningModule from arch description, generate callback config, debug DDP issues. **언제 X**: deep performance tuning (FSDP wrap policy, custom strategy) — 매 verify with profiler, 매 LLM 의 outdated API common. ## ❌ 안티패턴 - **`.cuda()` inside LightningModule**: Lightning manages device — use `self.device` or just rely on Trainer. - **Manual DDP setup**: Lightning handles, don't double-wrap. - **Logging in DDP without `sync_dist=True`**: rank-0 only logs, miss aggregation. - **`automatic_optimization=True` for GAN**: silent wrong loss flow — manual mode. - **Pinning to old Lightning 1.x**: 매 2.x API change (lightning.pytorch namespace), 매 2026 의 2.x+ standard. ## 🧪 검증 / 중복 - Verified (lightning.ai docs 2026, Lightning 2.x release notes, Falcon 2019 origin paper, Lightning Studios). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — LightningModule + Trainer + DDP/FSDP patterns, 2026 alt comparison |