"매 loss 가 power-law in compute, params, data — predictable extrapolation". Kaplan 2020 → Chinchilla 2022 → modern over-train regime (Llama 3.1 70B trained on 15T tokens, 200× Chinchilla-optimal). 매 inference cost 가 dominant cost 일 때 small model + huge data 가 win.
매 핵심
매 Kaplan 2020 (Original)
L(N, D, C) = power-law: loss decreases predictably with params N, data D, compute C.
Compute-optimal: scale N, D 가 6:1 ratio (params dominate).
매 contradiction with Chinchilla 후에 입증.
매 Chinchilla 2022 (DeepMind)
70B params + 1.4T tokens 가 Gopher 280B (300B tok) 매 outperform.
매 optimal: D/N ≈ 20 tokens/param (compute-optimal frontier).
importmatplotlib.pyplotaspltC=1e22# fixed computeNs=np.logspace(8,11,50)Ds=C/(6*Ns)losses=[chinchilla_loss(n,d)forn,dinzip(Ns,Ds)]plt.loglog(Ns,losses)plt.xlabel("Params (N)")plt.ylabel("Loss")plt.title("IsoFLOP at C=1e22")# 매 minimum 가 optimal N 위치.
Over-train economics
deftotal_cost(N,D,requests_per_year,years=3):train_flops=6*N*Dinference_flops_per_req=2*N*1024# 1k token geninference_total=inference_flops_per_req*requests_per_year*years# $/FLOP on H100 ~ 3e-19return(train_flops+inference_total)*3e-19# 70B Chinchilla-optimal vs 8B over-trainedprint(total_cost(70e9,1.4e12,1e10))# large modelprint(total_cost(8e9,15e12,1e10))# small over-train# 매 8B over-train 가 inference-heavy 시 cheap.
MoE scaling adjustment
defmoe_effective_params(total,active):# DeepSeek-V3: total=671B, active=37B# Sparse models 매 different scaling exponentgeometric_mean=(total*active)**0.5returngeometric_meanprint(moe_effective_params(671e9,37e9))# ~1.6e11
Test-time compute scaling (o1, Claude reasoning)
# 매 new axis: extend thinking tokens at inferencedefreasoning_quality(thinking_tokens):# OpenAI o1 reported: 매 log-linear improvementreturn0.4+0.05*np.log10(thinking_tokens+1)print(reasoning_quality(100))# 0.5print(reasoning_quality(100000))# 0.65
매 결정 기준
상황
Approach
Inference-heavy (chat product)
매 over-train small model (Llama 3.1 8B)
Research / benchmark
매 Chinchilla-optimal
Latency-critical
매 distillation + over-train
Reasoning workloads
매 test-time compute scaling (o1-style)
Multi-modal
매 separate scaling curves per modality
기본값: 매 inference cost 가 train cost 매 dominate 하므로 over-train (50-200 tok/param).
언제: 매 budget allocation, isoflop comparison, model sizing decision.
언제 X: 매 fine-tune budget (매 different dynamics), 매 RL post-training (매 separate scaling laws).
❌ 안티패턴
Train-only optimization: 매 inference cost 무시 → over-large model.
Naive Kaplan: 매 outdated, Chinchilla 매 supersedes for dense models.
Extrapolation past data: 매 power-law breaks at extreme scale.
Ignoring data quality: 매 token count alone 매 misleading (FineWeb-Edu vs CommonCrawl).
🧪 검증 / 중복
Verified (Hoffmann et al. 2022 Chinchilla, Kaplan et al. 2020).
Modern: Llama 3.1 paper 2024, DeepSeek-V3 tech report 2024.
신뢰도 A.
🕓 Changelog
날짜
변경
2026-05-08
Phase 1
2026-05-10
Manual cleanup — Chinchilla, over-train regime, test-time scaling 추가