Files
2nd/10_Wiki/Topics/AI_and_ML/Cross-Entropy Loss.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

246 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-cross-entropy-loss
title: Cross-Entropy Loss
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [cross-entropy, NLL, log loss, focal loss, label smoothing, KL divergence]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [loss-function, cross-entropy, classification, deep-learning, focal-loss, label-smoothing, llm-pretraining]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch / TensorFlow / JAX
---
# Cross-Entropy Loss
## 매 한 줄
> **"매 prediction 의 truth 와 의 distance"**. 매 entropy + 매 KL divergence 의 base. 매 classification 의 standard. 매 LLM pretraining (next-token prediction) 의 same. 매 modern: focal loss, label smoothing, soft target.
## 매 핵심
### 매 formula
$$H(p, q) = -\sum_x p(x) \log q(x)$$
- 매 p = 매 ground truth (one-hot for classification).
- 매 q = 매 model 의 prediction.
### 매 binary case
$$L = -[y \log \hat{y} + (1-y) \log(1-\hat{y})]$$
### 매 multi-class
$$L = -\sum_c y_c \log \hat{y}_c$$
→ 매 one-hot 의 case 의 매 negative log likelihood (NLL).
### 매 vs MSE
- 매 MSE 의 sigmoid 의 vanishing gradient.
- 매 cross-entropy 의 sigmoid + linear gradient.
- 매 classification 의 standard.
### 매 information theory connection
- 매 H(p) = 매 entropy.
- 매 KL(p || q) = 매 H(p, q) - H(p).
- 매 cross-entropy 의 minimize ≡ 매 KL 의 minimize (with fixed p).
### 매 변형
#### Focal Loss (Lin 2017)
- 매 imbalanced class 의 hard 예제 의 focus.
- 매 (1 - p_t)^γ 의 weight.
#### Label Smoothing
- 매 one-hot → soft (e.g., 0.9 / 0.025 / ...).
- 매 over-confidence 의 mitigate.
- 매 calibration 향상.
#### Class weight
- 매 imbalanced 의 weight.
- 매 minority class 의 보강.
#### Soft target (knowledge distillation)
- 매 teacher 의 distribution 의 student 의 target.
#### Soft cross-entropy
- 매 p 의 distribution (e.g., LLM token prediction 의 entropy regularize).
### 매 numerical stability
- 매 logsoftmax + NLL > 매 softmax + log.
- 매 PyTorch `F.cross_entropy` = 매 logits + integer label (combined).
### 매 LLM pretraining
- 매 standard: 매 next-token cross-entropy.
- 매 perplexity = exp(loss).
- 매 매 token 의 매 vocab distribution.
## 💻 패턴
### Binary CE (PyTorch)
```python
import torch
import torch.nn.functional as F
# 매 logits (raw scores) + label (0 / 1)
logits = torch.randn(32, 1) # 매 batch 32, binary
labels = torch.randint(0, 2, (32, 1)).float()
loss = F.binary_cross_entropy_with_logits(logits, labels)
# 매 numerical stable
```
### Multi-class CE
```python
# 매 logits (B, C) + label (B,) integer
logits = torch.randn(32, 10) # 매 10 class
labels = torch.randint(0, 10, (32,))
loss = F.cross_entropy(logits, labels)
# 매 inside: 매 logsoftmax + NLL
```
### With class weight (imbalanced)
```python
class_weights = torch.tensor([1.0, 5.0, 2.0]) # 매 minority class 의 5×
loss = F.cross_entropy(logits, labels, weight=class_weights)
```
### Focal loss
```python
def focal_loss(logits, labels, gamma=2.0, alpha=0.25):
ce_loss = F.cross_entropy(logits, labels, reduction='none')
pt = torch.exp(-ce_loss) # 매 prob of correct class
focal = alpha * (1 - pt) ** gamma * ce_loss
return focal.mean()
```
### Label smoothing
```python
# 매 PyTorch 1.10+
loss = F.cross_entropy(logits, labels, label_smoothing=0.1)
# 매 manual
def cross_entropy_smooth(logits, labels, smoothing=0.1, num_classes=10):
log_probs = F.log_softmax(logits, dim=-1)
nll_loss = -log_probs.gather(dim=-1, index=labels.unsqueeze(1)).squeeze(1)
smooth_loss = -log_probs.mean(dim=-1)
return ((1 - smoothing) * nll_loss + smoothing * smooth_loss).mean()
```
### LLM pretraining (next-token)
```python
def next_token_loss(model, input_ids):
logits = model(input_ids).logits # (B, T, V)
# 매 shift: 매 t → 매 t+1 prediction
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = input_ids[:, 1:].contiguous()
loss = F.cross_entropy(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1),
)
return loss
# 매 perplexity
perplexity = torch.exp(loss).item()
```
### Knowledge distillation (soft target)
```python
def distillation_loss(student_logits, teacher_logits, labels, T=4, alpha=0.7):
# 매 soft target (KL divergence with temperature)
soft_loss = F.kl_div(
F.log_softmax(student_logits / T, dim=-1),
F.softmax(teacher_logits / T, dim=-1),
reduction='batchmean',
) * T * T
# 매 hard target (regular CE)
hard_loss = F.cross_entropy(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_loss
```
### Calibration check
```python
def expected_calibration_error(logits, labels, n_bins=10):
probs = F.softmax(logits, dim=-1)
confidences, predictions = probs.max(-1)
accuracies = (predictions == labels).float()
bin_boundaries = torch.linspace(0, 1, n_bins + 1)
ece = 0
for i in range(n_bins):
in_bin = (confidences > bin_boundaries[i]) & (confidences <= bin_boundaries[i+1])
if in_bin.sum() > 0:
avg_conf = confidences[in_bin].mean()
avg_acc = accuracies[in_bin].mean()
ece += abs(avg_conf - avg_acc) * in_bin.float().mean()
return ece.item()
```
### Soft cross-entropy (for distribution target)
```python
def soft_cross_entropy(logits, target_probs):
log_probs = F.log_softmax(logits, dim=-1)
return -(target_probs * log_probs).sum(dim=-1).mean()
```
### Mixup (regularization with soft label)
```python
def mixup(x, y, alpha=0.2):
lam = np.random.beta(alpha, alpha)
idx = torch.randperm(x.size(0))
mixed_x = lam * x + (1 - lam) * x[idx]
return mixed_x, y, y[idx], lam
def mixup_loss(logits, y_a, y_b, lam):
return lam * F.cross_entropy(logits, y_a) + (1 - lam) * F.cross_entropy(logits, y_b)
```
## 매 결정 기준
| 상황 | Loss |
|---|---|
| Standard classification | Cross-entropy |
| Imbalanced | Focal loss / class weight |
| Calibration | + label smoothing |
| Distillation | Soft target + KL |
| Long-tail | Focal + class-balanced |
| Hard examples | Focal (γ=2) |
| LLM pretrain | Next-token CE |
**기본값**: F.cross_entropy + label_smoothing 0.1 (대부분).
## 🔗 Graph
- 부모: [[Loss-Function]] · [[Information_Theory|Information-Theory]] · [[Deep-Learning]]
- 변형: [[Focal-Loss]] · [[Label-Smoothing]] · [[LLM_Optimization_and_Deployment_Strategies|Knowledge-Distillation]] · [[KL-Divergence]]
- 응용: [[Image-Classification-Mastery]]
- Adjacent: [[Bias-vs-Variance]] · [[Bias-Correction-Algorithm]] · [[Cognitive-Biases]]
## 🤖 LLM 활용
**언제**: 매 classification model. 매 LLM training. 매 distillation.
**언제 X**: 매 regression (use MSE / Huber). 매 ranking (use ListNet, etc).
## ❌ 안티패턴
- **MSE for classification**: 매 vanishing gradient.
- **One-hot 의 hard label** + small data: 매 over-confidence.
- **No class weight** (imbalanced): 매 majority class dominate.
- **Softmax + log (separate)**: 매 numerical instability — 매 logsoftmax 의 use.
- **Label smoothing 의 too high**: 매 calibration over-correct.
## 🧪 검증 / 중복
- Verified (Bishop "Pattern Recognition", Lin Focal Loss, Hinton distillation).
- 신뢰도 A.
- Related: [[Information_Theory|Information-Theory]] · [[Bias-vs-Variance]] · [[Bias-Correction-Algorithm]] · [[Best-of-N_Sampling]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — formula + variant + 매 binary / multi / focal / smoothing / distill code |