"많이 뽑고 best 의 select". 매 N response 의 generate + RM 의 score → best 1 의 output. 매 inference-time compute 의 가장 simple form. 매 OpenAI o1 / DeepSeek R1 의 underlying principle 의 base case.
📖 핵심
매 algorithm
매 prompt → N response (temperature > 0).
매 response 의 score (Reward Model / verifier / majority vote).
매 best 의 select.
매 selection method
Method
Use case
Reward Model
매 general (RLHF reward)
Verifier
매 math, code (correctness)
Majority Vote (Self-Consistency)
매 reasoning 의 final answer
Process Reward Model (PRM)
매 step-by-step
LLM-as-judge
매 subjective (creative)
매 inference-time compute
매 model size ↑ X — 매 inference 의 N ↑.
매 small model + N=64 가 매 large model 의 single 의 outperform.
매 RL 의 alternative.
매 OpenAI o1 / o3 의 chain-of-thought 의 internal sampling.
Self-Consistency (Wang et al. 2022)
매 chain-of-thought 의 N response 의 generate.
매 final answer 의 majority vote.
매 GSM8K + 매 17%p improvement.
매 economics
N
Quality
Cost
1
baseline
1×
4
+5-10%p
4×
16
+10-15%p
16×
64
+15-20%p
64×
256
diminishing
256×
→ 매 sweet spot 의 task-dependent.
매 variant
Rejection sampling fine-tune (RFT)
매 N response → 매 verifier 의 pass 의 select → 매 SFT.
매 LLaMA-3 / DeepSeek 의 use.
Iterative refinement
매 N → best → 매 다시 N → ... → 매 converge.
Tree-of-Thought (ToT)
매 BoN + 매 search.
매 backtrack OK.
Beam search
매 N parallel + step-wise prune.
매 weakness
Reward hacking: 매 RM 의 spurious feature 의 exploit.
Diversity collapse: 매 high temperature 가 X → 매 N 의 same.
Cost: 매 N× compute.
Latency: 매 user-facing 의 X.
→ 매 cost-aware 의 N tuning.
💻 패턴
Self-consistency (vote)
importcollectionsfromvllmimportLLM,SamplingParamsllm=LLM(model='meta-llama/Llama-3-8B')sampling=SamplingParams(n=8,temperature=0.7,max_tokens=512)prompt="What is 1234 * 5678? Show your reasoning step by step. End with 'Answer: <number>'."outputs=llm.generate([prompt],sampling)answers=[]foroinoutputs[0].outputs:match=re.search(r'Answer:\s*(\d+)',o.text)ifmatch:answers.append(int(match.group(1)))final=collections.Counter(answers).most_common(1)[0][0]
defgenerate_rft_dataset(prompts,verifier,n=8):dataset=[]forpromptinprompts:candidates=generate_n(prompt,n=n)passing=[cforcincandidatesifverifier(prompt,c)]ifpassing:best=max(passing,key=lambdac:c.score)dataset.append({'prompt':prompt,'response':best.text})returndataset# 매 SFT on 매 dataset
defllm_judge(prompt,candidates):judge_prompt=f"""Given the prompt:
{prompt}Rate each response 1-10. Pick the best.
{format_candidates(candidates)}Reply with: BEST=<index>"""judgment=llm.generate(judge_prompt)idx=int(re.search(r'BEST=(\d+)',judgment).group(1))returncandidates[idx]
🤔 결정 기준
상황
Method
Math / verifiable
Self-consistency (vote)
Code
Verifier (test 실행)
General quality
RM-based BoN
Subjective
LLM-as-judge
Self-improve
RFT
매 deep reasoning
Tree-of-Thought / o1-style
기본값: Self-consistency (8-16) 의 baseline. 매 RM 가 있으면 BoN.
언제: 매 verifiable task (math, code). 매 quality > latency. 매 RM available. 매 self-improvement loop.
언제 X: 매 strict latency. 매 RM 없 + verifier 없. 매 streaming response.
❌ 안티패턴
N=1 + temperature=0: 매 BoN 의 X.
Same temperature 의 모든 sample: 매 diversity X.
Reward hacking 무시: 매 RM 의 exploit.
N → ∞: 매 cost ↑↑, 매 quality plateau.
Verifier 없 + RM 없: 매 BoN 의 X.
Latency-critical 의 BoN: 매 wrong tool.
🧪 검증 / 중복
Verified (Wang et al. 2022, OpenAI o1, Cobbe et al.).