Files
2nd/10_Wiki/Topics/AI_and_ML/PyTorch-Lightning.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

7.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-pytorch-lightning PyTorch Lightning 10_Wiki/Topics verified self
Lightning
pl
lightning.pytorch
none A 0.9 applied
pytorch
training
framework
distributed
2026-05-10 pending
language framework
python pytorch-lightning

PyTorch Lightning

매 한 줄

"매 PyTorch boilerplate 의 elimination — research-style structured trainer". LightningModule (model + optim + step) + Trainer (loop + distributed + logging) 의 separation. 2026 현재 매 still strong for research / classical DL, 매 LLM-era 의 HuggingFace Trainer / Accelerate / TRL 의 dominate.

매 핵심

매 LightningModule lifecycle

  • __init__: model + hparams.
  • forward(x): inference.
  • training_step(batch, idx) -> loss: per-batch train.
  • validation_step / test_step: eval.
  • configure_optimizers() -> optim | (optim, sched): opt + scheduler.
  • on_*_epoch_end hooks for aggregation.

매 Trainer features

  • Multi-GPU (DDP, FSDP), TPU, MPS automatic.
  • Mixed precision (precision="bf16-mixed").
  • Gradient accumulation, clipping built-in.
  • Callbacks (EarlyStopping, ModelCheckpoint, LR monitor).
  • Loggers (TensorBoard, WandB, MLflow, CSV).
  • fast_dev_run, overfit_batches, limit_*_batches for debug.

매 vs alternatives (2026)

Framework Best for
Lightning research, classical CV/NLP, structured projects
HF Trainer HF-ecosystem (transformers + datasets), LLM SFT
HF Accelerate minimal wrapper, retain raw PyTorch loop
TRL RLHF / DPO / GRPO, LLM post-training
MosaicML Composer streaming, throughput-optimized
raw PyTorch full control, simple scripts

매 응용

  1. CV training (image classification, segmentation, detection).
  2. Tabular DL (TabNet, FT-Transformer).
  3. Audio / speech (W2V2 finetune).
  4. Mid-size LLM finetune (when not using HF Trainer).
  5. Self-supervised pretraining (SimCLR, MAE).

💻 패턴

Minimal LightningModule

import lightning as L
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

class LitClassifier(L.LightningModule):
    def __init__(self, lr=1e-3):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.Flatten(), nn.Linear(28*28, 256), nn.ReLU(), nn.Linear(256, 10),
        )
        self.loss = nn.CrossEntropyLoss()

    def forward(self, x):
        return self.net(x)

    def training_step(self, batch, idx):
        x, y = batch
        logits = self(x)
        loss = self.loss(logits, y)
        self.log("train_loss", loss, prog_bar=True)
        return loss

    def validation_step(self, batch, idx):
        x, y = batch
        logits = self(x)
        acc = (logits.argmax(-1) == y).float().mean()
        self.log("val_acc", acc, prog_bar=True)

    def configure_optimizers(self):
        opt = torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)
        sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=10)
        return [opt], [sched]

Trainer with callbacks

from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint, LearningRateMonitor
from lightning.pytorch.loggers import WandbLogger

trainer = L.Trainer(
    max_epochs=20,
    accelerator="auto",        # cuda / mps / cpu
    devices="auto",
    precision="bf16-mixed",
    accumulate_grad_batches=4,
    gradient_clip_val=1.0,
    callbacks=[
        EarlyStopping(monitor="val_acc", mode="max", patience=3),
        ModelCheckpoint(monitor="val_acc", mode="max", save_top_k=2),
        LearningRateMonitor(),
    ],
    logger=WandbLogger(project="lit-mnist"),
)
trainer.fit(LitClassifier(), train_dl, val_dl)

Multi-GPU DDP

trainer = L.Trainer(
    accelerator="gpu",
    devices=4,
    strategy="ddp",            # or "fsdp" for >7B params
    precision="bf16-mixed",
    sync_batchnorm=True,
)
# 매 launch with `python train.py` — Lightning 의 spawn workers

FSDP for large model

from lightning.pytorch.strategies import FSDPStrategy
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from functools import partial

policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={MyTransformerBlock})

trainer = L.Trainer(
    devices=8,
    strategy=FSDPStrategy(auto_wrap_policy=policy, cpu_offload=False),
    precision="bf16-mixed",
)

LightningDataModule

class MNISTDataModule(L.LightningDataModule):
    def __init__(self, batch_size=64):
        super().__init__()
        self.bs = batch_size

    def prepare_data(self):
        from torchvision.datasets import MNIST
        MNIST(".", train=True, download=True)

    def setup(self, stage=None):
        from torchvision.datasets import MNIST
        from torchvision import transforms
        t = transforms.ToTensor()
        self.train = MNIST(".", train=True, transform=t)
        self.val = MNIST(".", train=False, transform=t)

    def train_dataloader(self):
        return DataLoader(self.train, batch_size=self.bs, num_workers=4, shuffle=True)

    def val_dataloader(self):
        return DataLoader(self.val, batch_size=self.bs, num_workers=4)

LightningCLI (config-driven)

# train.py
from lightning.pytorch.cli import LightningCLI

def main():
    LightningCLI(LitClassifier, MNISTDataModule)

if __name__ == "__main__":
    main()
# python train.py fit --config config.yaml --trainer.max_epochs=30

Resume from checkpoint

trainer.fit(model, datamodule, ckpt_path="lightning_logs/version_3/checkpoints/last.ckpt")
# or load model standalone
model = LitClassifier.load_from_checkpoint("path.ckpt")

Manual optimization (GAN, RL)

class LitGAN(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.automatic_optimization = False
        ...

    def training_step(self, batch, idx):
        opt_g, opt_d = self.optimizers()
        # discriminator step
        opt_d.zero_grad(); d_loss = ...; self.manual_backward(d_loss); opt_d.step()
        # generator step
        opt_g.zero_grad(); g_loss = ...; self.manual_backward(g_loss); opt_g.step()

매 결정 기준

상황 Approach
Research, multi-experiment, structured Lightning
HF transformers SFT HF Trainer (closer to ecosystem)
Custom training loop, retain control Accelerate
LLM RLHF / DPO / GRPO TRL
Single-GPU script <100 lines raw PyTorch
Need callbacks + DDP fast Lightning

기본값: 매 research / non-HF training 의 Lightning + bf16-mixed + DDP. 매 HF transformers job 의 HF Trainer. 매 LLM post-training 의 TRL.

🔗 Graph

🤖 LLM 활용

언제: scaffold LightningModule from arch description, generate callback config, debug DDP issues. 언제 X: deep performance tuning (FSDP wrap policy, custom strategy) — 매 verify with profiler, 매 LLM 의 outdated API common.

안티패턴

  • .cuda() inside LightningModule: Lightning manages device — use self.device or just rely on Trainer.
  • Manual DDP setup: Lightning handles, don't double-wrap.
  • Logging in DDP without sync_dist=True: rank-0 only logs, miss aggregation.
  • automatic_optimization=True for GAN: silent wrong loss flow — manual mode.
  • Pinning to old Lightning 1.x: 매 2.x API change (lightning.pytorch namespace), 매 2026 의 2.x+ standard.

🧪 검증 / 중복

  • Verified (lightning.ai docs 2026, Lightning 2.x release notes, Falcon 2019 origin paper, Lightning Studios).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — LightningModule + Trainer + DDP/FSDP patterns, 2026 alt comparison