Files
2nd/10_Wiki/Topics/Computer_Science_and_Theory/Kullback-Leibler-Divergence.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

4.8 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-kullback-leibler-divergence Kullback-Leibler Divergence 10_Wiki/Topics verified self
KL Divergence
Relative Entropy
KL-D
none A 0.95 applied
information-theory
divergence
ml
vae
rlhf
2026-05-10 pending
language framework
python torch, scipy

Kullback-Leibler Divergence

매 한 줄

"매 distribution 간의 directed information loss". KL Divergence D_{\text{KL}}(P \| Q) = \mathbb{E}_P[\log P/Q] 는 reference distribution QP 를 encode 시 expected extra bits. Kullback & Leibler (1951) 가 정의했고, 2026 ML 에서는 VAE ELBO, RLHF (PPO/DPO), variational inference, distillation 의 매 core loss term.

매 핵심

매 Definition

  • discrete: D_{\text{KL}}(P\|Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}
  • continuous: \int p(x) \log \frac{p(x)}{q(x)} dx
  • always \ge 0 (Gibbs inequality), =0 iff P=Q
  • NOT symmetric, NOT a metric (no triangle inequality)
  • D_{\text{KL}}(P\|Q) = H(P, Q) - H(P) — cross-entropy minus entropy

매 Mode behavior

  • Forward $D_{\text{KL}}(P|Q)$: Q must cover all mass of P → "mode-covering"
  • Reverse $D_{\text{KL}}(Q|P)$: Q goes where P has mass → "mode-seeking"
  • VAE 는 reverse, EP 는 forward

매 응용

  1. VAE ELBO: \mathbb{E}[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z)).
  2. RLHF PPO: \beta \cdot D_{\text{KL}}(\pi \| \pi_{\text{ref}}) penalty.
  3. Knowledge distillation: D_{\text{KL}}(p_T \| p_S) with temperature.
  4. Variational inference: \arg\min_q D_{\text{KL}}(q \| p).
  5. Mutual information: I(X;Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y)).

💻 패턴

Discrete KL

import numpy as np
def kl_div(p, q, eps=1e-12):
    p, q = np.asarray(p), np.asarray(q)
    return np.sum(p * (np.log(p + eps) - np.log(q + eps)))

p = np.array([0.5, 0.3, 0.2])
q = np.array([0.4, 0.4, 0.2])
print(kl_div(p, q))

PyTorch KL (numerically stable)

import torch
import torch.nn.functional as F

# inputs MUST be log-probs for kl_div first arg
log_p = F.log_softmax(model_logits, dim=-1)
q = F.softmax(target_logits, dim=-1)
loss = F.kl_div(log_p, q, reduction="batchmean")

KL between Gaussians (closed form)

def kl_gaussian(mu1, var1, mu2, var2):
    return 0.5 * (
        torch.log(var2 / var1) + (var1 + (mu1-mu2)**2) / var2 - 1
    ).sum()

# VAE: q ~ N(mu, sigma^2), prior N(0, 1)
def kl_to_standard_normal(mu, log_var):
    return -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())

Distillation loss with temperature

def distill_kl(student_logits, teacher_logits, T=4.0):
    log_p_s = F.log_softmax(student_logits / T, dim=-1)
    p_t = F.softmax(teacher_logits / T, dim=-1)
    return F.kl_div(log_p_s, p_t, reduction="batchmean") * (T*T)

RLHF PPO KL penalty (per-token)

def ppo_kl_penalty(logp_new, logp_ref, beta=0.05):
    # token-level KL via log-prob difference
    return beta * (logp_new - logp_ref)  # used as reward shaping

Forward vs reverse comparison

# Approximate q (Gaussian) to bimodal p
# - reverse KL D(q||p): q picks one mode (mode-seeking)
# - forward KL D(p||q): q spans both modes (mode-covering, broader)

매 결정 기준

Need Form
variational posterior fit reverse D_{\text{KL}}(q\|p)
spread (cover all modes) forward D_{\text{KL}}(p\|q)
symmetric JS divergence
bounded, metric Wasserstein, Hellinger
RLHF stability per-token reverse KL with \beta schedule

기본값: 매 problem 따라 — VAE 면 reverse, EP 면 forward.

🔗 Graph

🤖 LLM 활용

언제: 매 distribution-level loss 정의, RLHF 의 reference model anchoring, distillation. 언제 X: 매 distance metric 이 필요할 때 — KL 은 metric 이 X — Wasserstein 사용.

안티패턴

  • Symmetric 가정: D_{\text{KL}}(P\|Q) \ne D_{\text{KL}}(Q\|P).
  • Disjoint support: Q(x)=0, P(x)>0 이면 \infty — smooth or use JS.
  • F.kl_div 의 input 순서 혼동: 첫 arg 는 log-prob.
  • Distillation T 무시: temperature T 없이 sharp distribution 사용 → poor signal.
  • RLHF 에서 KL collapse: \beta 너무 작으면 reward hacking.

🧪 검증 / 중복

  • Verified (Cover & Thomas 2006 textbook ch 2, MacKay 2003 ch 2, Kingma VAE 2013).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — KL definition, mode behavior, VAE/RLHF/distillation patterns