--- id: wiki-2026-0508-scaling-laws-for-llms title: Scaling Laws for LLMs category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Chinchilla Law, Kaplan Scaling, Compute-Optimal Scaling] duplicate_of: none source_trust_level: A confidence_score: 0.92 verification_status: applied tags: [llm, scaling-laws, training, compute, chinchilla] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch --- # Scaling Laws for LLMs ## 매 한 줄 > **"매 loss 가 power-law in compute, params, data — predictable extrapolation"**. Kaplan 2020 → Chinchilla 2022 → modern over-train regime (Llama 3.1 70B trained on 15T tokens, 200× Chinchilla-optimal). 매 inference cost 가 dominant cost 일 때 small model + huge data 가 win. ## 매 핵심 ### 매 Kaplan 2020 (Original) - L(N, D, C) = power-law: loss decreases predictably with params N, data D, compute C. - Compute-optimal: scale N, D 가 6:1 ratio (params dominate). - 매 contradiction with Chinchilla 후에 입증. ### 매 Chinchilla 2022 (DeepMind) - 70B params + 1.4T tokens 가 Gopher 280B (300B tok) 매 outperform. - 매 optimal: D/N ≈ 20 tokens/param (compute-optimal frontier). - N_opt ∝ C^0.5, D_opt ∝ C^0.5 (1:1 scaling). ### 매 Modern Over-train (2024-2026) - Llama 3.1 8B: 15T tokens (1875 tok/param, 90× Chinchilla). - 매 inference-cost dominance: 매 train once, serve billions → small model 가 cheap. - DeepSeek-V3 671B MoE: 14.8T tokens (sparse activation). - Claude Opus 4.7 / GPT-5: 매 details closed but 매 trend clear. ### 매 응용 1. Pre-train budget allocation: param vs data vs context length tradeoff. 2. Distillation target sizing: teacher → student compute curve. 3. Architecture search: 매 isoflop curves 비교. ## 💻 패턴 ### Chinchilla optimal point ```python def chinchilla_optimal(compute_flops): # Hoffmann et al. 2022: a=0.5, b=0.5 # N_opt ≈ G * C^0.5, D_opt ≈ C / (6*N_opt) G = 0.6 # empirical fit N_opt = G * (compute_flops ** 0.5) D_opt = compute_flops / (6 * N_opt) return {"params": N_opt, "tokens": D_opt, "ratio": D_opt / N_opt} # 1e24 FLOPs budget (1 yotta-flop) print(chinchilla_optimal(1e24)) # {params: ~6e11, tokens: ~2.6e11, ratio: ~20} ``` ### Loss prediction ```python import numpy as np def chinchilla_loss(N, D): # L(N,D) = E + A/N^alpha + B/D^beta E, A, B = 1.69, 406.4, 410.7 alpha, beta = 0.34, 0.28 return E + A / (N ** alpha) + B / (D ** beta) # 7B model, 2T tokens L = chinchilla_loss(7e9, 2e12) print(f"predicted loss: {L:.3f}") ``` ### IsoFLOP curve ```python import matplotlib.pyplot as plt C = 1e22 # fixed compute Ns = np.logspace(8, 11, 50) Ds = C / (6 * Ns) losses = [chinchilla_loss(n, d) for n, d in zip(Ns, Ds)] plt.loglog(Ns, losses) plt.xlabel("Params (N)") plt.ylabel("Loss") plt.title("IsoFLOP at C=1e22") # 매 minimum 가 optimal N 위치. ``` ### Over-train economics ```python def total_cost(N, D, requests_per_year, years=3): train_flops = 6 * N * D inference_flops_per_req = 2 * N * 1024 # 1k token gen inference_total = inference_flops_per_req * requests_per_year * years # $/FLOP on H100 ~ 3e-19 return (train_flops + inference_total) * 3e-19 # 70B Chinchilla-optimal vs 8B over-trained print(total_cost(70e9, 1.4e12, 1e10)) # large model print(total_cost(8e9, 15e12, 1e10)) # small over-train # 매 8B over-train 가 inference-heavy 시 cheap. ``` ### MoE scaling adjustment ```python def moe_effective_params(total, active): # DeepSeek-V3: total=671B, active=37B # Sparse models 매 different scaling exponent geometric_mean = (total * active) ** 0.5 return geometric_mean print(moe_effective_params(671e9, 37e9)) # ~1.6e11 ``` ### Test-time compute scaling (o1, Claude reasoning) ```python # 매 new axis: extend thinking tokens at inference def reasoning_quality(thinking_tokens): # OpenAI o1 reported: 매 log-linear improvement return 0.4 + 0.05 * np.log10(thinking_tokens + 1) print(reasoning_quality(100)) # 0.5 print(reasoning_quality(100000)) # 0.65 ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Inference-heavy (chat product) | 매 over-train small model (Llama 3.1 8B) | | Research / benchmark | 매 Chinchilla-optimal | | Latency-critical | 매 distillation + over-train | | Reasoning workloads | 매 test-time compute scaling (o1-style) | | Multi-modal | 매 separate scaling curves per modality | **기본값**: 매 inference cost 가 train cost 매 dominate 하므로 over-train (50-200 tok/param). ## 🔗 Graph - Adjacent: [[Test-Time Compute]] · [[LLM_Optimization_and_Deployment_Strategies|Distillation]] · [[Mixture of Experts]] ## 🤖 LLM 활용 **언제**: 매 budget allocation, isoflop comparison, model sizing decision. **언제 X**: 매 fine-tune budget (매 different dynamics), 매 RL post-training (매 separate scaling laws). ## ❌ 안티패턴 - **Train-only optimization**: 매 inference cost 무시 → over-large model. - **Naive Kaplan**: 매 outdated, Chinchilla 매 supersedes for dense models. - **Extrapolation past data**: 매 power-law breaks at extreme scale. - **Ignoring data quality**: 매 token count alone 매 misleading (FineWeb-Edu vs CommonCrawl). ## 🧪 검증 / 중복 - Verified (Hoffmann et al. 2022 Chinchilla, Kaplan et al. 2020). - Modern: Llama 3.1 paper 2024, DeepSeek-V3 tech report 2024. - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Chinchilla, over-train regime, test-time scaling 추가 |