--- id: wiki-2026-0508-layer-normalization title: Layer Normalization category: 10_Wiki/Topics status: verified canonical_id: self aliases: [LayerNorm, LN, RMSNorm] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [deep-learning, transformer, normalization, llm-internals] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch --- # Layer Normalization ## 매 한 줄 > **"매 sample 마다 feature 축으로 정규화"**. BatchNorm 과 달리 batch 의존 없음 — RNN/Transformer 의 표준. 2026 LLM 은 대부분 **RMSNorm** + **pre-norm** 구조 (LLaMA, Mistral, Qwen). ## 매 핵심 ### 매 수식 - **LayerNorm**: `y = γ * (x - μ) / sqrt(σ² + ε) + β`, μ/σ 는 **마지막 축** 기준. - **RMSNorm**: μ 빼지 않음. `y = γ * x / sqrt(mean(x²) + ε)`. 7-15% faster, 거의 동등 성능. - γ, β 는 learnable. ### 매 vs BatchNorm | | LayerNorm | BatchNorm | |---|---|---| | 정규화 축 | feature | batch | | Batch=1 동작 | 됨 | 통계 무의미 | | Train/Eval 차이 | 없음 | running mean 사용 | | 시퀀스 길이 가변 | 됨 | 안 됨 | | GPU memory | 낮음 | 더 낮음 (가끔) | ### 매 Pre-norm vs Post-norm - **Post-norm** (원조 Transformer): `LN(x + Sublayer(x))`. 깊으면 학습 불안정. - **Pre-norm** (GPT-2+): `x + Sublayer(LN(x))`. gradient flow 안정, warmup 적게 필요. - 2026: 거의 모든 대형 LLM = pre-norm + RMSNorm. ### 매 변형 - **GroupNorm**: feature 를 G group 으로 나눠 정규화. CNN/diffusion. - **InstanceNorm**: per-sample, per-channel. style transfer. - **DeepNorm** (Microsoft): post-norm 으로 1000 layer 가능. - **ScaleNorm**: γ 하나만, β 제거. ## 💻 패턴 ### PyTorch LayerNorm ```python import torch.nn as nn ln = nn.LayerNorm(normalized_shape=768) # 마지막 차원 y = ln(x) # x: (B, T, 768) ``` ### RMSNorm 직접 구현 ```python class RMSNorm(nn.Module): def __init__(self, dim, eps=1e-6): super().__init__() self.weight = nn.Parameter(torch.ones(dim)) self.eps = eps def forward(self, x): rms = x.pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt() return self.weight * x * rms ``` ### Pre-norm Transformer block ```python class Block(nn.Module): def __init__(self, dim, n_heads): super().__init__() self.ln1 = nn.LayerNorm(dim) self.attn = nn.MultiheadAttention(dim, n_heads, batch_first=True) self.ln2 = nn.LayerNorm(dim) self.mlp = nn.Sequential(nn.Linear(dim, 4*dim), nn.GELU(), nn.Linear(4*dim, dim)) def forward(self, x): x = x + self.attn(self.ln1(x), self.ln1(x), self.ln1(x))[0] x = x + self.mlp(self.ln2(x)) return x ``` ### Fused LN (Apex/Triton) ```python # nvidia apex 또는 torch.compile 자동 fused from apex.normalization import FusedLayerNorm ln = FusedLayerNorm(768) # ~20% faster on A100 ``` ### Manual LN (이해용) ```python def layer_norm(x, gamma, beta, eps=1e-5): mu = x.mean(-1, keepdim=True) var = x.var(-1, unbiased=False, keepdim=True) return gamma * (x - mu) / (var + eps).sqrt() + beta ``` ### LayerNorm vs RMSNorm 벤치마크 메모 ```python # A100, hidden=4096, seq=2048 # LayerNorm: 0.42ms # RMSNorm: 0.35ms (-17%) # 정확도 차이: < 0.1pp on perplexity ``` ## 매 결정 기준 | 모델 | 정규화 | |---|---| | 표준 Transformer/BERT | LayerNorm post-norm or pre-norm | | 대형 LLM (decoder-only) | RMSNorm + pre-norm | | CNN | BatchNorm or GroupNorm | | Diffusion U-Net | GroupNorm | | Style transfer | InstanceNorm | | 1000+ layer | DeepNorm | **기본값**: LLM 만들면 RMSNorm + pre-norm. ## 🔗 Graph - 부모: [[Normalization]], [[Transformer]] - 변형: [[RMSNorm]], [[GroupNorm]], [[BatchNorm]], [[DeepNorm]] - 응용: [[BERT]], [[GPT]], [[LLaMA]] - Adjacent: [[Residual-Connection]], [[Weight-Initialization]], [[Gradient-Stability]] ## 🤖 LLM 활용 **언제**: Transformer 류, RNN, 가변 길이 시퀀스, batch=1 inference. **언제 X**: 큰 batch CNN — BatchNorm 더 빠르고 정확. ## ❌ 안티패턴 - **Post-norm 으로 깊은 Transformer**: 발산. pre-norm 사용. - **eps 너무 작음 (1e-12)**: fp16 underflow. 1e-5 ~ 1e-6 권장. - **β 학습 (RMSNorm)**: RMSNorm 정의에 β 없음. γ 만. - **마지막 축이 아닌 곳**: LayerNorm 은 last dim 정규화가 표준. ## 🧪 검증 / 중복 - Ba et al. 2016 (LayerNorm), Zhang & Sennrich 2019 (RMSNorm). - LLaMA paper (Touvron 2023), Xiong et al. 2020 (pre-norm 분석). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — RMSNorm/pre-norm 표준화 반영, BatchNorm 비교표 |