Files
2nd/10_Wiki/Topics/AI_and_ML/Cross-Entropy Loss.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

246 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-cross-entropy-loss
title: Cross-Entropy Loss
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [cross-entropy, NLL, log loss, focal loss, label smoothing, KL divergence]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [loss-function, cross-entropy, classification, deep-learning, focal-loss, label-smoothing, llm-pretraining]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch / TensorFlow / JAX
---
# Cross-Entropy Loss
## 매 한 줄
> **"매 prediction 의 truth 와 의 distance"**. 매 entropy + 매 KL divergence 의 base. 매 classification 의 standard. 매 LLM pretraining (next-token prediction) 의 same. 매 modern: focal loss, label smoothing, soft target.
## 매 핵심
### 매 formula
$$H(p, q) = -\sum_x p(x) \log q(x)$$
- 매 p = 매 ground truth (one-hot for classification).
- 매 q = 매 model 의 prediction.
### 매 binary case
$$L = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$
### 매 multi-class
$$L = -\sum_c y_c \log \hat{y}_c$$
→ 매 one-hot 의 case 의 매 negative log likelihood (NLL).
### 매 vs MSE
- 매 MSE 의 sigmoid 의 vanishing gradient.
- 매 cross-entropy 의 sigmoid + linear gradient.
- 매 classification 의 standard.
### 매 information theory connection
- 매 H(p) = 매 entropy.
- 매 KL(p || q) = 매 H(p, q) - H(p).
- 매 cross-entropy 의 minimize ≡ 매 KL 의 minimize (with fixed p).
### 매 변형
#### Focal Loss (Lin 2017)
- 매 imbalanced class 의 hard 예제 의 focus.
- 매 (1 - p_t)^γ 의 weight.
#### Label Smoothing
- 매 one-hot → soft (e.g., 0.9 / 0.025 / ...).
- 매 over-confidence 의 mitigate.
- 매 calibration 향상.
#### Class weight
- 매 imbalanced 의 weight.
- 매 minority class 의 보강.
#### Soft target (knowledge distillation)
- 매 teacher 의 distribution 의 student 의 target.
#### Soft cross-entropy
- 매 p 의 distribution (e.g., LLM token prediction 의 entropy regularize).
### 매 numerical stability
- 매 logsoftmax + NLL > 매 softmax + log.
- 매 PyTorch `F.cross_entropy` = 매 logits + integer label (combined).
### 매 LLM pretraining
- 매 standard: 매 next-token cross-entropy.
- 매 perplexity = exp(loss).
- 매 매 token 의 매 vocab distribution.
## 💻 패턴
### Binary CE (PyTorch)
```python
import torch
import torch.nn.functional as F
# 매 logits (raw scores) + label (0 / 1)
logits = torch.randn(32, 1) # 매 batch 32, binary
labels = torch.randint(0, 2, (32, 1)).float()
loss = F.binary_cross_entropy_with_logits(logits, labels)
# 매 numerical stable
```
### Multi-class CE
```python
# 매 logits (B, C) + label (B,) integer
logits = torch.randn(32, 10) # 매 10 class
labels = torch.randint(0, 10, (32,))
loss = F.cross_entropy(logits, labels)
# 매 inside: 매 logsoftmax + NLL
```
### With class weight (imbalanced)
```python
class_weights = torch.tensor([1.0, 5.0, 2.0]) # 매 minority class 의 5×
loss = F.cross_entropy(logits, labels, weight=class_weights)
```
### Focal loss
```python
def focal_loss(logits, labels, gamma=2.0, alpha=0.25):
ce_loss = F.cross_entropy(logits, labels, reduction='none')
pt = torch.exp(-ce_loss) # 매 prob of correct class
focal = alpha * (1 - pt) ** gamma * ce_loss
return focal.mean()
```
### Label smoothing
```python
# 매 PyTorch 1.10+
loss = F.cross_entropy(logits, labels, label_smoothing=0.1)
# 매 manual
def cross_entropy_smooth(logits, labels, smoothing=0.1, num_classes=10):
log_probs = F.log_softmax(logits, dim=-1)
nll_loss = -log_probs.gather(dim=-1, index=labels.unsqueeze(1)).squeeze(1)
smooth_loss = -log_probs.mean(dim=-1)
return ((1 - smoothing) * nll_loss + smoothing * smooth_loss).mean()
```
### LLM pretraining (next-token)
```python
def next_token_loss(model, input_ids):
logits = model(input_ids).logits # (B, T, V)
# 매 shift: 매 t → 매 t+1 prediction
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = input_ids[:, 1:].contiguous()
loss = F.cross_entropy(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1),
)
return loss
# 매 perplexity
perplexity = torch.exp(loss).item()
```
### Knowledge distillation (soft target)
```python
def distillation_loss(student_logits, teacher_logits, labels, T=4, alpha=0.7):
# 매 soft target (KL divergence with temperature)
soft_loss = F.kl_div(
F.log_softmax(student_logits / T, dim=-1),
F.softmax(teacher_logits / T, dim=-1),
reduction='batchmean',
) * T * T
# 매 hard target (regular CE)
hard_loss = F.cross_entropy(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_loss
```
### Calibration check
```python
def expected_calibration_error(logits, labels, n_bins=10):
probs = F.softmax(logits, dim=-1)
confidences, predictions = probs.max(-1)
accuracies = (predictions == labels).float()
bin_boundaries = torch.linspace(0, 1, n_bins + 1)
ece = 0
for i in range(n_bins):
in_bin = (confidences > bin_boundaries[i]) & (confidences <= bin_boundaries[i+1])
if in_bin.sum() > 0:
avg_conf = confidences[in_bin].mean()
avg_acc = accuracies[in_bin].mean()
ece += abs(avg_conf - avg_acc) * in_bin.float().mean()
return ece.item()
```
### Soft cross-entropy (for distribution target)
```python
def soft_cross_entropy(logits, target_probs):
log_probs = F.log_softmax(logits, dim=-1)
return -(target_probs * log_probs).sum(dim=-1).mean()
```
### Mixup (regularization with soft label)
```python
def mixup(x, y, alpha=0.2):
lam = np.random.beta(alpha, alpha)
idx = torch.randperm(x.size(0))
mixed_x = lam * x + (1 - lam) * x[idx]
return mixed_x, y, y[idx], lam
def mixup_loss(logits, y_a, y_b, lam):
return lam * F.cross_entropy(logits, y_a) + (1 - lam) * F.cross_entropy(logits, y_b)
```
## 매 결정 기준
| 상황 | Loss |
|---|---|
| Standard classification | Cross-entropy |
| Imbalanced | Focal loss / class weight |
| Calibration | + label smoothing |
| Distillation | Soft target + KL |
| Long-tail | Focal + class-balanced |
| Hard examples | Focal (γ=2) |
| LLM pretrain | Next-token CE |
**기본값**: F.cross_entropy + label_smoothing 0.1 (대부분).
## 🔗 Graph
- 부모: [[Loss-Function]] · [[Information_Theory|Information-Theory]] · [[Deep Learning]]
- 변형: [[Focal-Loss]] · [[Label-Smoothing]] · [[LLM_Optimization_and_Deployment_Strategies|Knowledge-Distillation]] · [[KL-Divergence]]
- 응용: [[Image-Classification-Mastery]]
- Adjacent: [[Bias vs Variance Trade-off]] · [[Bias-Correction-Algorithm]] · [[Cognitive Biases]]
## 🤖 LLM 활용
**언제**: 매 classification model. 매 LLM training. 매 distillation.
**언제 X**: 매 regression (use MSE / Huber). 매 ranking (use ListNet, etc).
## ❌ 안티패턴
- **MSE for classification**: 매 vanishing gradient.
- **One-hot 의 hard label** + small data: 매 over-confidence.
- **No class weight** (imbalanced): 매 majority class dominate.
- **Softmax + log (separate)**: 매 numerical instability — 매 logsoftmax 의 use.
- **Label smoothing 의 too high**: 매 calibration over-correct.
## 🧪 검증 / 중복
- Verified (Bishop "Pattern Recognition", Lin Focal Loss, Hinton distillation).
- 신뢰도 A.
- Related: [[Information_Theory|Information-Theory]] · [[Bias vs Variance Trade-off]] · [[Bias-Correction-Algorithm]] · [[Best-of-N_Sampling]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — formula + variant + 매 binary / multi / focal / smoothing / distill code |