---
id: wiki-2026-0508-scaling-laws-for-llms
title: Scaling Laws for LLMs
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Chinchilla Law, Kaplan Scaling, Compute-Optimal Scaling]
duplicate_of: none
source_trust_level: A
confidence_score: 0.92
verification_status: applied
tags: [llm, scaling-laws, training, compute, chinchilla]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: pytorch
---

# Scaling Laws for LLMs

## 매 한 줄
> **"매 loss 가 power-law in compute, params, data — predictable extrapolation"**. Kaplan 2020 → Chinchilla 2022 → modern over-train regime (Llama 3.1 70B trained on 15T tokens, 200× Chinchilla-optimal). 매 inference cost 가 dominant cost 일 때 small model + huge data 가 win.

## 매 핵심

### 매 Kaplan 2020 (Original)
- L(N, D, C) = power-law: loss decreases predictably with params N, data D, compute C.
- Compute-optimal: scale N, D 가 6:1 ratio (params dominate).
- 매 contradiction with Chinchilla 후에 입증.

### 매 Chinchilla 2022 (DeepMind)
- 70B params + 1.4T tokens 가 Gopher 280B (300B tok) 매 outperform.
- 매 optimal: D/N ≈ 20 tokens/param (compute-optimal frontier).
- N_opt ∝ C^0.5, D_opt ∝ C^0.5 (1:1 scaling).

### 매 Modern Over-train (2024-2026)
- Llama 3.1 8B: 15T tokens (1875 tok/param, 90× Chinchilla).
- 매 inference-cost dominance: 매 train once, serve billions → small model 가 cheap.
- DeepSeek-V3 671B MoE: 14.8T tokens (sparse activation).
- Claude Opus 4.7 / GPT-5: 매 details closed but 매 trend clear.

### 매 응용
1. Pre-train budget allocation: param vs data vs context length tradeoff.
2. Distillation target sizing: teacher → student compute curve.
3. Architecture search: 매 isoflop curves 비교.

## 💻 패턴

### Chinchilla optimal point
```python
def chinchilla_optimal(compute_flops):
    # Hoffmann et al. 2022: a=0.5, b=0.5
    # N_opt ≈ G * C^0.5, D_opt ≈ C / (6*N_opt)
    G = 0.6  # empirical fit
    N_opt = G * (compute_flops ** 0.5)
    D_opt = compute_flops / (6 * N_opt)
    return {"params": N_opt, "tokens": D_opt, "ratio": D_opt / N_opt}

# 1e24 FLOPs budget (1 yotta-flop)
print(chinchilla_optimal(1e24))
# {params: ~6e11, tokens: ~2.6e11, ratio: ~20}
```

### Loss prediction
```python
import numpy as np

def chinchilla_loss(N, D):
    # L(N,D) = E + A/N^alpha + B/D^beta
    E, A, B = 1.69, 406.4, 410.7
    alpha, beta = 0.34, 0.28
    return E + A / (N ** alpha) + B / (D ** beta)

# 7B model, 2T tokens
L = chinchilla_loss(7e9, 2e12)
print(f"predicted loss: {L:.3f}")
```

### IsoFLOP curve
```python
import matplotlib.pyplot as plt

C = 1e22  # fixed compute
Ns = np.logspace(8, 11, 50)
Ds = C / (6 * Ns)
losses = [chinchilla_loss(n, d) for n, d in zip(Ns, Ds)]

plt.loglog(Ns, losses)
plt.xlabel("Params (N)")
plt.ylabel("Loss")
plt.title("IsoFLOP at C=1e22")
# 매 minimum 가 optimal N 위치.
```

### Over-train economics
```python
def total_cost(N, D, requests_per_year, years=3):
    train_flops = 6 * N * D
    inference_flops_per_req = 2 * N * 1024  # 1k token gen
    inference_total = inference_flops_per_req * requests_per_year * years
    # $/FLOP on H100 ~ 3e-19
    return (train_flops + inference_total) * 3e-19

# 70B Chinchilla-optimal vs 8B over-trained
print(total_cost(70e9, 1.4e12, 1e10))     # large model
print(total_cost(8e9, 15e12, 1e10))       # small over-train
# 매 8B over-train 가 inference-heavy 시 cheap.
```

### MoE scaling adjustment
```python
def moe_effective_params(total, active):
    # DeepSeek-V3: total=671B, active=37B
    # Sparse models 매 different scaling exponent
    geometric_mean = (total * active) ** 0.5
    return geometric_mean

print(moe_effective_params(671e9, 37e9))  # ~1.6e11
```

### Test-time compute scaling (o1, Claude reasoning)
```python
# 매 new axis: extend thinking tokens at inference
def reasoning_quality(thinking_tokens):
    # OpenAI o1 reported: 매 log-linear improvement
    return 0.4 + 0.05 * np.log10(thinking_tokens + 1)

print(reasoning_quality(100))    # 0.5
print(reasoning_quality(100000)) # 0.65
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Inference-heavy (chat product) | 매 over-train small model (Llama 3.1 8B) |
| Research / benchmark | 매 Chinchilla-optimal |
| Latency-critical | 매 distillation + over-train |
| Reasoning workloads | 매 test-time compute scaling (o1-style) |
| Multi-modal | 매 separate scaling curves per modality |

**기본값**: 매 inference cost 가 train cost 매 dominate 하므로 over-train (50-200 tok/param).

## 🔗 Graph
- Adjacent: [[Test-Time Compute]] · [[LLM_Optimization_and_Deployment_Strategies|Distillation]] · [[Mixture of Experts]]

## 🤖 LLM 활용
**언제**: 매 budget allocation, isoflop comparison, model sizing decision.
**언제 X**: 매 fine-tune budget (매 different dynamics), 매 RL post-training (매 separate scaling laws).

## ❌ 안티패턴
- **Train-only optimization**: 매 inference cost 무시 → over-large model.
- **Naive Kaplan**: 매 outdated, Chinchilla 매 supersedes for dense models.
- **Extrapolation past data**: 매 power-law breaks at extreme scale.
- **Ignoring data quality**: 매 token count alone 매 misleading (FineWeb-Edu vs CommonCrawl).

## 🧪 검증 / 중복
- Verified (Hoffmann et al. 2022 Chinchilla, Kaplan et al. 2020).
- Modern: Llama 3.1 paper 2024, DeepSeek-V3 tech report 2024.
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — Chinchilla, over-train regime, test-time scaling 추가 |