Files
2nd/10_Wiki/Topics/AI_and_ML/Test-Time Compute Scaling (추론 시간 계산 스케일링).md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

5.6 KiB

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-test-time-compute-scaling-추론-시간- Test Time Compute Scaling (추론 시간 계산 스케일링) 10_Wiki/Topics verified self
Test-Time Compute
Inference-Time Scaling
Reasoning Models
none A 0.9 applied
llm
reasoning
scaling
test-time-compute
2026-05-10 pending
language framework
Python vLLM / Anthropic SDK / OpenAI SDK

Test Time Compute Scaling (추론 시간 계산 스케일링)

매 한 줄

"매 think longer, get smarter". Test-time compute scaling 매 inference 시 더 많은 compute (매 longer chain-of-thought, sampling, search) 로 quality 의 trade off. OpenAI o1 (2024-09) → o3 / DeepSeek-R1 (2025-01) → Claude 4.x extended thinking (2025+) 의 paradigm. 매 training-time scaling laws 의 보완.

매 핵심

매 두 axes

  • More thinking (long CoT) — 매 single sample 안 더 긴 reasoning trace. o1, R1, Claude extended thinking.
  • Search / sampling — 매 multiple samples + verifier (best-of-N, MCTS, beam). AlphaCode, ReST, MathShepherd.

매 modern (2025-2026)

  • RL on reasoning — 매 RLHF + RL on verifiable rewards (math, code) → 매 long CoT 의 emerge. R1-zero, R1.
  • Extended thinking budgets — 매 Claude 의 thinking_budget parameter, OpenAI 의 reasoning_effort.
  • Scaling law — 매 log compute ↔ accuracy linear (Snell 2024, OpenAI o-series chart).
  • Cost shift — 매 training 1x 의 inference Nx — 매 economics 의 reshape.

매 응용

  1. Math (AIME, IMO).
  2. Code (SWE-bench, competition).
  3. Agentic planning (deep tool-use chains).
  4. Scientific reasoning (GPQA).

💻 패턴

Claude extended thinking

from anthropic import Anthropic
client = Anthropic()

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=8000,
    thinking={"type": "enabled", "budget_tokens": 16000},
    messages=[{"role": "user", "content": "Solve: ..."}],
)
for block in resp.content:
    if block.type == "thinking":
        print("THINK:", block.thinking[:200])
    elif block.type == "text":
        print("ANS:", block.text)

OpenAI reasoning effort

from openai import OpenAI
client = OpenAI()
resp = client.responses.create(
    model="o3",
    input="Prove the AM-GM inequality.",
    reasoning={"effort": "high"},   # low / medium / high
)
print(resp.output_text)

Best-of-N + verifier

def best_of_n(prompt, n=8, verifier=None):
    samples = [client.messages.create(
        model="claude-opus-4-7",
        max_tokens=2000,
        temperature=0.8,
        messages=[{"role": "user", "content": prompt}],
    ).content[0].text for _ in range(n)]
    return max(samples, key=verifier)  # 매 verifier: unit test pass count, etc.

Self-consistency (majority vote)

from collections import Counter
answers = [extract_answer(s) for s in samples]
final = Counter(answers).most_common(1)[0][0]

MCTS-style search (sketch)

def expand(node):
    children = [llm.continue_from(node.partial, temp=0.9) for _ in range(k)]
    return [Node(c, score=verifier(c)) for c in children]

def search(root, depth=4):
    frontier = [root]
    for _ in range(depth):
        candidates = sum((expand(n) for n in frontier), [])
        frontier = sorted(candidates, key=lambda n: -n.score)[:beam]
    return max(frontier, key=lambda n: n.score)

Budget controller

def adaptive_thinking(prompt, easy_budget=2000, hard_budget=32000):
    # 매 difficulty classifier 의 first
    diff = client.messages.create(model="claude-haiku-4", ...).content[0].text
    budget = hard_budget if "hard" in diff else easy_budget
    return client.messages.create(
        model="claude-opus-4-7",
        thinking={"type": "enabled", "budget_tokens": budget},
        messages=[{"role": "user", "content": prompt}],
    )

매 결정 기준

상황 Approach
Math / code with verifier RL-trained reasoning model (o3, R1) + search
Open-ended reasoning Extended thinking (Claude 4.x)
Latency-critical Skip — use small fast model
Cost-critical batch Self-consistency 4-8 samples
Search exploitable Best-of-N + verifier
Fuzzy quality Reasoning model > base model

기본값: 매 reasoning model (o3 / Claude extended thinking) 매 hard task, base model 매 easy task — 매 difficulty router 로 split.

🔗 Graph

🤖 LLM 활용

언제: 매 hard reasoning task, verifiable output (math/code), agent planning, quality > latency. 언제 X: 매 simple lookup / chat — 매 thinking 매 cost waste.

안티패턴

  • Always max thinking budget: 매 easy task 의 32k thinking 매 cost burn — 매 router 사용.
  • No verifier in best-of-N: 매 random sample 매 noise — 매 verifier (unit test, math check) 의 essential.
  • Stream thinking to user: 매 thinking content 매 internal — 매 user UI 에 final text 만.
  • Caching invalidation: 매 thinking budget 변경 시 cache miss — 매 stable budget 권장.

🧪 검증 / 중복

  • Verified (OpenAI o1/o3 system cards, DeepSeek-R1 paper 2025-01, Anthropic extended thinking docs, Snell et al. 2024).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — o-series / R1 / Claude extended thinking patterns