Files
2nd/10_Wiki/Topics/AI_and_ML/Scaling-Laws-for-LLMs.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.5 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-scaling-laws-for-llms Scaling Laws for LLMs 10_Wiki/Topics verified self
Chinchilla Law
Kaplan Scaling
Compute-Optimal Scaling
none A 0.92 applied
llm
scaling-laws
training
compute
chinchilla
2026-05-10 pending
language framework
python pytorch

Scaling Laws for LLMs

매 한 줄

"매 loss 가 power-law in compute, params, data — predictable extrapolation". Kaplan 2020 → Chinchilla 2022 → modern over-train regime (Llama 3.1 70B trained on 15T tokens, 200× Chinchilla-optimal). 매 inference cost 가 dominant cost 일 때 small model + huge data 가 win.

매 핵심

매 Kaplan 2020 (Original)

  • L(N, D, C) = power-law: loss decreases predictably with params N, data D, compute C.
  • Compute-optimal: scale N, D 가 6:1 ratio (params dominate).
  • 매 contradiction with Chinchilla 후에 입증.

매 Chinchilla 2022 (DeepMind)

  • 70B params + 1.4T tokens 가 Gopher 280B (300B tok) 매 outperform.
  • 매 optimal: D/N ≈ 20 tokens/param (compute-optimal frontier).
  • N_opt ∝ C^0.5, D_opt ∝ C^0.5 (1:1 scaling).

매 Modern Over-train (2024-2026)

  • Llama 3.1 8B: 15T tokens (1875 tok/param, 90× Chinchilla).
  • 매 inference-cost dominance: 매 train once, serve billions → small model 가 cheap.
  • DeepSeek-V3 671B MoE: 14.8T tokens (sparse activation).
  • Claude Opus 4.7 / GPT-5: 매 details closed but 매 trend clear.

매 응용

  1. Pre-train budget allocation: param vs data vs context length tradeoff.
  2. Distillation target sizing: teacher → student compute curve.
  3. Architecture search: 매 isoflop curves 비교.

💻 패턴

Chinchilla optimal point

def chinchilla_optimal(compute_flops):
    # Hoffmann et al. 2022: a=0.5, b=0.5
    # N_opt ≈ G * C^0.5, D_opt ≈ C / (6*N_opt)
    G = 0.6  # empirical fit
    N_opt = G * (compute_flops ** 0.5)
    D_opt = compute_flops / (6 * N_opt)
    return {"params": N_opt, "tokens": D_opt, "ratio": D_opt / N_opt}

# 1e24 FLOPs budget (1 yotta-flop)
print(chinchilla_optimal(1e24))
# {params: ~6e11, tokens: ~2.6e11, ratio: ~20}

Loss prediction

import numpy as np

def chinchilla_loss(N, D):
    # L(N,D) = E + A/N^alpha + B/D^beta
    E, A, B = 1.69, 406.4, 410.7
    alpha, beta = 0.34, 0.28
    return E + A / (N ** alpha) + B / (D ** beta)

# 7B model, 2T tokens
L = chinchilla_loss(7e9, 2e12)
print(f"predicted loss: {L:.3f}")

IsoFLOP curve

import matplotlib.pyplot as plt

C = 1e22  # fixed compute
Ns = np.logspace(8, 11, 50)
Ds = C / (6 * Ns)
losses = [chinchilla_loss(n, d) for n, d in zip(Ns, Ds)]

plt.loglog(Ns, losses)
plt.xlabel("Params (N)")
plt.ylabel("Loss")
plt.title("IsoFLOP at C=1e22")
# 매 minimum 가 optimal N 위치.

Over-train economics

def total_cost(N, D, requests_per_year, years=3):
    train_flops = 6 * N * D
    inference_flops_per_req = 2 * N * 1024  # 1k token gen
    inference_total = inference_flops_per_req * requests_per_year * years
    # $/FLOP on H100 ~ 3e-19
    return (train_flops + inference_total) * 3e-19

# 70B Chinchilla-optimal vs 8B over-trained
print(total_cost(70e9, 1.4e12, 1e10))     # large model
print(total_cost(8e9, 15e12, 1e10))       # small over-train
# 매 8B over-train 가 inference-heavy 시 cheap.

MoE scaling adjustment

def moe_effective_params(total, active):
    # DeepSeek-V3: total=671B, active=37B
    # Sparse models 매 different scaling exponent
    geometric_mean = (total * active) ** 0.5
    return geometric_mean

print(moe_effective_params(671e9, 37e9))  # ~1.6e11

Test-time compute scaling (o1, Claude reasoning)

# 매 new axis: extend thinking tokens at inference
def reasoning_quality(thinking_tokens):
    # OpenAI o1 reported: 매 log-linear improvement
    return 0.4 + 0.05 * np.log10(thinking_tokens + 1)

print(reasoning_quality(100))    # 0.5
print(reasoning_quality(100000)) # 0.65

매 결정 기준

상황 Approach
Inference-heavy (chat product) 매 over-train small model (Llama 3.1 8B)
Research / benchmark 매 Chinchilla-optimal
Latency-critical 매 distillation + over-train
Reasoning workloads 매 test-time compute scaling (o1-style)
Multi-modal 매 separate scaling curves per modality

기본값: 매 inference cost 가 train cost 매 dominate 하므로 over-train (50-200 tok/param).

🔗 Graph

🤖 LLM 활용

언제: 매 budget allocation, isoflop comparison, model sizing decision. 언제 X: 매 fine-tune budget (매 different dynamics), 매 RL post-training (매 separate scaling laws).

안티패턴

  • Train-only optimization: 매 inference cost 무시 → over-large model.
  • Naive Kaplan: 매 outdated, Chinchilla 매 supersedes for dense models.
  • Extrapolation past data: 매 power-law breaks at extreme scale.
  • Ignoring data quality: 매 token count alone 매 misleading (FineWeb-Edu vs CommonCrawl).

🧪 검증 / 중복

  • Verified (Hoffmann et al. 2022 Chinchilla, Kaplan et al. 2020).
  • Modern: Llama 3.1 paper 2024, DeepSeek-V3 tech report 2024.
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — Chinchilla, over-train regime, test-time scaling 추가