"매 think longer, get smarter". Test-time compute scaling 매 inference 시 더 많은 compute (매 longer chain-of-thought, sampling, search) 로 quality 의 trade off. OpenAI o1 (2024-09) → o3 / DeepSeek-R1 (2025-01) → Claude 4.x extended thinking (2025+) 의 paradigm. 매 training-time scaling laws 의 보완.
매 핵심
매 두 axes
More thinking (long CoT) — 매 single sample 안 더 긴 reasoning trace. o1, R1, Claude extended thinking.
fromopenaiimportOpenAIclient=OpenAI()resp=client.responses.create(model="o3",input="Prove the AM-GM inequality.",reasoning={"effort":"high"},# low / medium / high)print(resp.output_text)
Best-of-N + verifier
defbest_of_n(prompt,n=8,verifier=None):samples=[client.messages.create(model="claude-opus-4-7",max_tokens=2000,temperature=0.8,messages=[{"role":"user","content":prompt}],).content[0].textfor_inrange(n)]returnmax(samples,key=verifier)# 매 verifier: unit test pass count, etc.
defadaptive_thinking(prompt,easy_budget=2000,hard_budget=32000):# 매 difficulty classifier 의 firstdiff=client.messages.create(model="claude-haiku-4",...).content[0].textbudget=hard_budgetif"hard"indiffelseeasy_budgetreturnclient.messages.create(model="claude-opus-4-7",thinking={"type":"enabled","budget_tokens":budget},messages=[{"role":"user","content":prompt}],)
매 결정 기준
상황
Approach
Math / code with verifier
RL-trained reasoning model (o3, R1) + search
Open-ended reasoning
Extended thinking (Claude 4.x)
Latency-critical
Skip — use small fast model
Cost-critical batch
Self-consistency 4-8 samples
Search exploitable
Best-of-N + verifier
Fuzzy quality
Reasoning model > base model
기본값: 매 reasoning model (o3 / Claude extended thinking) 매 hard task, base model 매 easy task — 매 difficulty router 로 split.