--- id: wiki-2026-0508-cross-entropy-loss title: Cross-Entropy Loss category: 10_Wiki/Topics status: verified canonical_id: self aliases: [cross-entropy, NLL, log loss, focal loss, label smoothing, KL divergence] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [loss-function, cross-entropy, classification, deep-learning, focal-loss, label-smoothing, llm-pretraining] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch / TensorFlow / JAX --- # Cross-Entropy Loss ## 매 한 줄 > **"매 prediction 의 truth 와 의 distance"**. 매 entropy + 매 KL divergence 의 base. 매 classification 의 standard. 매 LLM pretraining (next-token prediction) 의 same. 매 modern: focal loss, label smoothing, soft target. ## 매 핵심 ### 매 formula $$H(p, q) = -\sum_x p(x) \log q(x)$$ - 매 p = 매 ground truth (one-hot for classification). - 매 q = 매 model 의 prediction. ### 매 binary case $$L = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$ ### 매 multi-class $$L = -\sum_c y_c \log \hat{y}_c$$ → 매 one-hot 의 case 의 매 negative log likelihood (NLL). ### 매 vs MSE - 매 MSE 의 sigmoid 의 vanishing gradient. - 매 cross-entropy 의 sigmoid + linear gradient. - 매 classification 의 standard. ### 매 information theory connection - 매 H(p) = 매 entropy. - 매 KL(p || q) = 매 H(p, q) - H(p). - 매 cross-entropy 의 minimize ≡ 매 KL 의 minimize (with fixed p). ### 매 변형 #### Focal Loss (Lin 2017) - 매 imbalanced class 의 hard 예제 의 focus. - 매 (1 - p_t)^γ 의 weight. #### Label Smoothing - 매 one-hot → soft (e.g., 0.9 / 0.025 / ...). - 매 over-confidence 의 mitigate. - 매 calibration 향상. #### Class weight - 매 imbalanced 의 weight. - 매 minority class 의 보강. #### Soft target (knowledge distillation) - 매 teacher 의 distribution 의 student 의 target. #### Soft cross-entropy - 매 p 의 distribution (e.g., LLM token prediction 의 entropy regularize). ### 매 numerical stability - 매 logsoftmax + NLL > 매 softmax + log. - 매 PyTorch `F.cross_entropy` = 매 logits + integer label (combined). ### 매 LLM pretraining - 매 standard: 매 next-token cross-entropy. - 매 perplexity = exp(loss). - 매 매 token 의 매 vocab distribution. ## 💻 패턴 ### Binary CE (PyTorch) ```python import torch import torch.nn.functional as F # 매 logits (raw scores) + label (0 / 1) logits = torch.randn(32, 1) # 매 batch 32, binary labels = torch.randint(0, 2, (32, 1)).float() loss = F.binary_cross_entropy_with_logits(logits, labels) # 매 numerical stable ``` ### Multi-class CE ```python # 매 logits (B, C) + label (B,) integer logits = torch.randn(32, 10) # 매 10 class labels = torch.randint(0, 10, (32,)) loss = F.cross_entropy(logits, labels) # 매 inside: 매 logsoftmax + NLL ``` ### With class weight (imbalanced) ```python class_weights = torch.tensor([1.0, 5.0, 2.0]) # 매 minority class 의 5× loss = F.cross_entropy(logits, labels, weight=class_weights) ``` ### Focal loss ```python def focal_loss(logits, labels, gamma=2.0, alpha=0.25): ce_loss = F.cross_entropy(logits, labels, reduction='none') pt = torch.exp(-ce_loss) # 매 prob of correct class focal = alpha * (1 - pt) ** gamma * ce_loss return focal.mean() ``` ### Label smoothing ```python # 매 PyTorch 1.10+ loss = F.cross_entropy(logits, labels, label_smoothing=0.1) # 매 manual def cross_entropy_smooth(logits, labels, smoothing=0.1, num_classes=10): log_probs = F.log_softmax(logits, dim=-1) nll_loss = -log_probs.gather(dim=-1, index=labels.unsqueeze(1)).squeeze(1) smooth_loss = -log_probs.mean(dim=-1) return ((1 - smoothing) * nll_loss + smoothing * smooth_loss).mean() ``` ### LLM pretraining (next-token) ```python def next_token_loss(model, input_ids): logits = model(input_ids).logits # (B, T, V) # 매 shift: 매 t → 매 t+1 prediction shift_logits = logits[:, :-1, :].contiguous() shift_labels = input_ids[:, 1:].contiguous() loss = F.cross_entropy( shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1), ) return loss # 매 perplexity perplexity = torch.exp(loss).item() ``` ### Knowledge distillation (soft target) ```python def distillation_loss(student_logits, teacher_logits, labels, T=4, alpha=0.7): # 매 soft target (KL divergence with temperature) soft_loss = F.kl_div( F.log_softmax(student_logits / T, dim=-1), F.softmax(teacher_logits / T, dim=-1), reduction='batchmean', ) * T * T # 매 hard target (regular CE) hard_loss = F.cross_entropy(student_logits, labels) return alpha * soft_loss + (1 - alpha) * hard_loss ``` ### Calibration check ```python def expected_calibration_error(logits, labels, n_bins=10): probs = F.softmax(logits, dim=-1) confidences, predictions = probs.max(-1) accuracies = (predictions == labels).float() bin_boundaries = torch.linspace(0, 1, n_bins + 1) ece = 0 for i in range(n_bins): in_bin = (confidences > bin_boundaries[i]) & (confidences <= bin_boundaries[i+1]) if in_bin.sum() > 0: avg_conf = confidences[in_bin].mean() avg_acc = accuracies[in_bin].mean() ece += abs(avg_conf - avg_acc) * in_bin.float().mean() return ece.item() ``` ### Soft cross-entropy (for distribution target) ```python def soft_cross_entropy(logits, target_probs): log_probs = F.log_softmax(logits, dim=-1) return -(target_probs * log_probs).sum(dim=-1).mean() ``` ### Mixup (regularization with soft label) ```python def mixup(x, y, alpha=0.2): lam = np.random.beta(alpha, alpha) idx = torch.randperm(x.size(0)) mixed_x = lam * x + (1 - lam) * x[idx] return mixed_x, y, y[idx], lam def mixup_loss(logits, y_a, y_b, lam): return lam * F.cross_entropy(logits, y_a) + (1 - lam) * F.cross_entropy(logits, y_b) ``` ## 매 결정 기준 | 상황 | Loss | |---|---| | Standard classification | Cross-entropy | | Imbalanced | Focal loss / class weight | | Calibration | + label smoothing | | Distillation | Soft target + KL | | Long-tail | Focal + class-balanced | | Hard examples | Focal (γ=2) | | LLM pretrain | Next-token CE | **기본값**: F.cross_entropy + label_smoothing 0.1 (대부분). ## 🔗 Graph - 부모: [[Loss-Function]] · [[Information_Theory|Information-Theory]] · [[Deep-Learning]] - 변형: [[Focal-Loss]] · [[Label-Smoothing]] · [[LLM_Optimization_and_Deployment_Strategies|Knowledge-Distillation]] · [[KL-Divergence]] - 응용: [[Image-Classification-Mastery]] - Adjacent: [[Bias-vs-Variance]] · [[Bias-Correction-Algorithm]] · [[Cognitive-Biases]] ## 🤖 LLM 활용 **언제**: 매 classification model. 매 LLM training. 매 distillation. **언제 X**: 매 regression (use MSE / Huber). 매 ranking (use ListNet, etc). ## ❌ 안티패턴 - **MSE for classification**: 매 vanishing gradient. - **One-hot 의 hard label** + small data: 매 over-confidence. - **No class weight** (imbalanced): 매 majority class dominate. - **Softmax + log (separate)**: 매 numerical instability — 매 logsoftmax 의 use. - **Label smoothing 의 too high**: 매 calibration over-correct. ## 🧪 검증 / 중복 - Verified (Bishop "Pattern Recognition", Lin Focal Loss, Hinton distillation). - 신뢰도 A. - Related: [[Information_Theory|Information-Theory]] · [[Bias-vs-Variance]] · [[Bias-Correction-Algorithm]] · [[Best-of-N_Sampling]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — formula + variant + 매 binary / multi / focal / smoothing / distill code |