--- id: wiki-2026-0508-kullback-leibler-divergence title: Kullback-Leibler Divergence category: 10_Wiki/Topics status: verified canonical_id: self aliases: [KL Divergence, Relative Entropy, KL-D] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [information-theory, divergence, ml, vae, rlhf] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: torch, scipy --- # Kullback-Leibler Divergence ## 매 한 줄 > **"매 distribution 간의 directed information loss"**. KL Divergence $D_{\text{KL}}(P \| Q) = \mathbb{E}_P[\log P/Q]$ 는 reference distribution $Q$ 로 $P$ 를 encode 시 expected extra bits. Kullback & Leibler (1951) 가 정의했고, 2026 ML 에서는 VAE ELBO, RLHF (PPO/DPO), variational inference, distillation 의 매 core loss term. ## 매 핵심 ### 매 Definition - discrete: $D_{\text{KL}}(P\|Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$ - continuous: $\int p(x) \log \frac{p(x)}{q(x)} dx$ - always $\ge 0$ (Gibbs inequality), $=0$ iff $P=Q$ - **NOT symmetric**, **NOT a metric** (no triangle inequality) - $D_{\text{KL}}(P\|Q) = H(P, Q) - H(P)$ — cross-entropy minus entropy ### 매 Mode behavior - **Forward $D_{\text{KL}}(P\|Q)$**: Q must cover all mass of P → "mode-covering" - **Reverse $D_{\text{KL}}(Q\|P)$**: Q goes where P has mass → "mode-seeking" - VAE 는 reverse, EP 는 forward ### 매 응용 1. **VAE ELBO**: $\mathbb{E}[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z))$. 2. **RLHF PPO**: $\beta \cdot D_{\text{KL}}(\pi \| \pi_{\text{ref}})$ penalty. 3. **Knowledge distillation**: $D_{\text{KL}}(p_T \| p_S)$ with temperature. 4. **Variational inference**: $\arg\min_q D_{\text{KL}}(q \| p)$. 5. **Mutual information**: $I(X;Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y))$. ## 💻 패턴 ### Discrete KL ```python import numpy as np def kl_div(p, q, eps=1e-12): p, q = np.asarray(p), np.asarray(q) return np.sum(p * (np.log(p + eps) - np.log(q + eps))) p = np.array([0.5, 0.3, 0.2]) q = np.array([0.4, 0.4, 0.2]) print(kl_div(p, q)) ``` ### PyTorch KL (numerically stable) ```python import torch import torch.nn.functional as F # inputs MUST be log-probs for kl_div first arg log_p = F.log_softmax(model_logits, dim=-1) q = F.softmax(target_logits, dim=-1) loss = F.kl_div(log_p, q, reduction="batchmean") ``` ### KL between Gaussians (closed form) ```python def kl_gaussian(mu1, var1, mu2, var2): return 0.5 * ( torch.log(var2 / var1) + (var1 + (mu1-mu2)**2) / var2 - 1 ).sum() # VAE: q ~ N(mu, sigma^2), prior N(0, 1) def kl_to_standard_normal(mu, log_var): return -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp()) ``` ### Distillation loss with temperature ```python def distill_kl(student_logits, teacher_logits, T=4.0): log_p_s = F.log_softmax(student_logits / T, dim=-1) p_t = F.softmax(teacher_logits / T, dim=-1) return F.kl_div(log_p_s, p_t, reduction="batchmean") * (T*T) ``` ### RLHF PPO KL penalty (per-token) ```python def ppo_kl_penalty(logp_new, logp_ref, beta=0.05): # token-level KL via log-prob difference return beta * (logp_new - logp_ref) # used as reward shaping ``` ### Forward vs reverse comparison ```python # Approximate q (Gaussian) to bimodal p # - reverse KL D(q||p): q picks one mode (mode-seeking) # - forward KL D(p||q): q spans both modes (mode-covering, broader) ``` ## 매 결정 기준 | Need | Form | |---|---| | variational posterior fit | reverse $D_{\text{KL}}(q\|p)$ | | spread (cover all modes) | forward $D_{\text{KL}}(p\|q)$ | | symmetric | JS divergence | | bounded, metric | Wasserstein, Hellinger | | RLHF stability | per-token reverse KL with $\beta$ schedule | **기본값**: 매 problem 따라 — VAE 면 reverse, EP 면 forward. ## 🔗 Graph - 부모: [[Information_Theory|Information-Theory]] - 응용: [[VAE]] · [[RLHF]] · [[LLM_Optimization_and_Deployment_Strategies|Knowledge-Distillation]] · [[Variational-Inference]] - Adjacent: [[Cross-Entropy]] · [[Mutual-Information]] ## 🤖 LLM 활용 **언제**: 매 distribution-level loss 정의, RLHF 의 reference model anchoring, distillation. **언제 X**: 매 distance metric 이 필요할 때 — KL 은 metric 이 X — Wasserstein 사용. ## ❌ 안티패턴 - **Symmetric 가정**: $D_{\text{KL}}(P\|Q) \ne D_{\text{KL}}(Q\|P)$. - **Disjoint support**: $Q(x)=0, P(x)>0$ 이면 $\infty$ — smooth or use JS. - **`F.kl_div` 의 input 순서 혼동**: 첫 arg 는 log-prob. - **Distillation T 무시**: temperature $T$ 없이 sharp distribution 사용 → poor signal. - **RLHF 에서 KL collapse**: $\beta$ 너무 작으면 reward hacking. ## 🧪 검증 / 중복 - Verified (Cover & Thomas 2006 textbook ch 2, MacKay 2003 ch 2, Kingma VAE 2013). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — KL definition, mode behavior, VAE/RLHF/distillation patterns |