--- id: wiki-2026-0508-best-of-n-sampling title: Best-of-N Sampling category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Best-of-N, BoN, rejection sampling, inference-time compute, majority voting, self-consistency] duplicate_of: none source_trust_level: A confidence_score: 0.92 verification_status: applied tags: [llm, inference, reasoning, reward-model, rejection-sampling, test-time-compute, o1, self-consistency] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: Transformers / vLLM / TRL --- # Best-of-N Sampling ## 📌 한 줄 통찰 > **"많이 뽑고 best 의 select"**. 매 N response 의 generate + RM 의 score → best 1 의 output. 매 inference-time compute 의 가장 simple form. 매 OpenAI o1 / DeepSeek R1 의 underlying principle 의 base case. ## 📖 핵심 ### 매 algorithm 1. 매 prompt → N response (temperature > 0). 2. 매 response 의 score (Reward Model / verifier / majority vote). 3. 매 best 의 select. ### 매 selection method | Method | Use case | |---|---| | Reward Model | 매 general (RLHF reward) | | Verifier | 매 math, code (correctness) | | Majority Vote (Self-Consistency) | 매 reasoning 의 final answer | | Process Reward Model (PRM) | 매 step-by-step | | LLM-as-judge | 매 subjective (creative) | ### 매 inference-time compute - 매 model size ↑ X — 매 inference 의 N ↑. - 매 small model + N=64 가 매 large model 의 single 의 outperform. - 매 RL 의 alternative. - 매 OpenAI o1 / o3 의 chain-of-thought 의 internal sampling. ### Self-Consistency (Wang et al. 2022) - 매 chain-of-thought 의 N response 의 generate. - 매 final answer 의 majority vote. - 매 GSM8K + 매 17%p improvement. ### 매 economics | N | Quality | Cost | |---|---|---| | 1 | baseline | 1× | | 4 | +5-10%p | 4× | | 16 | +10-15%p | 16× | | 64 | +15-20%p | 64× | | 256 | diminishing | 256× | → 매 sweet spot 의 task-dependent. ### 매 variant #### Rejection sampling fine-tune (RFT) - 매 N response → 매 verifier 의 pass 의 select → 매 SFT. - 매 LLaMA-3 / DeepSeek 의 use. #### Iterative refinement - 매 N → best → 매 다시 N → ... → 매 converge. #### Tree-of-Thought (ToT) - 매 BoN + 매 search. - 매 backtrack OK. #### Beam search - 매 N parallel + step-wise prune. ### 매 weakness 1. **Reward hacking**: 매 RM 의 spurious feature 의 exploit. 2. **Diversity collapse**: 매 high temperature 가 X → 매 N 의 same. 3. **Cost**: 매 N× compute. 4. **Latency**: 매 user-facing 의 X. → 매 cost-aware 의 N tuning. ## 💻 패턴 ### Self-consistency (vote) ```python import collections from vllm import LLM, SamplingParams llm = LLM(model='meta-llama/Llama-3-8B') sampling = SamplingParams(n=8, temperature=0.7, max_tokens=512) prompt = "What is 1234 * 5678? Show your reasoning step by step. End with 'Answer: '." outputs = llm.generate([prompt], sampling) answers = [] for o in outputs[0].outputs: match = re.search(r'Answer:\s*(\d+)', o.text) if match: answers.append(int(match.group(1))) final = collections.Counter(answers).most_common(1)[0][0] ``` ### Best-of-N with Reward Model ```python from transformers import AutoModelForSequenceClassification, AutoTokenizer rm_model = AutoModelForSequenceClassification.from_pretrained('reward-model') rm_tokenizer = AutoTokenizer.from_pretrained('reward-model') def score(prompt, response): inputs = rm_tokenizer(prompt + response, return_tensors='pt', truncation=True) return rm_model(**inputs).logits[0, 0].item() def best_of_n(prompt, n=16, T=0.8): sampling = SamplingParams(n=n, temperature=T, max_tokens=512) outputs = llm.generate([prompt], sampling)[0].outputs scored = [(o.text, score(prompt, o.text)) for o in outputs] return max(scored, key=lambda x: x[1])[0] ``` ### Rejection sampling for fine-tune ```python def generate_rft_dataset(prompts, verifier, n=8): dataset = [] for prompt in prompts: candidates = generate_n(prompt, n=n) passing = [c for c in candidates if verifier(prompt, c)] if passing: best = max(passing, key=lambda c: c.score) dataset.append({'prompt': prompt, 'response': best.text}) return dataset # 매 SFT on 매 dataset ``` → 매 self-improvement loop. ### Tree-of-Thought (simplified) ```python def tot_search(prompt, depth=3, breadth=4): state = [prompt] for d in range(depth): candidates = [] for s in state: children = generate_n(s, n=breadth) for c in children: score = evaluate(c) candidates.append((s + '\n' + c.text, score)) candidates.sort(key=lambda x: -x[1]) state = [c[0] for c in candidates[:breadth]] return state[0] ``` ### LLM-as-judge selection ```python def llm_judge(prompt, candidates): judge_prompt = f"""Given the prompt: {prompt} Rate each response 1-10. Pick the best. {format_candidates(candidates)} Reply with: BEST=""" judgment = llm.generate(judge_prompt) idx = int(re.search(r'BEST=(\d+)', judgment).group(1)) return candidates[idx] ``` ## 🤔 결정 기준 | 상황 | Method | |---|---| | Math / verifiable | Self-consistency (vote) | | Code | Verifier (test 실행) | | General quality | RM-based BoN | | Subjective | LLM-as-judge | | Self-improve | RFT | | 매 deep reasoning | Tree-of-Thought / o1-style | **기본값**: Self-consistency (8-16) 의 baseline. 매 RM 가 있으면 BoN. ## 🔗 Graph - 부모: [[Test-Time-Compute]] - 변형: [[Self-Consistency]] · [[Rejection-Sampling]] - 응용: [[DeepSeek-R1]] - Adjacent: [[Reward-Model]] · [[RLHF]] · [[Chain-of-Thought]] · [[LLM-as-Judge]] ## 🤖 LLM 활용 **언제**: 매 verifiable task (math, code). 매 quality > latency. 매 RM available. 매 self-improvement loop. **언제 X**: 매 strict latency. 매 RM 없 + verifier 없. 매 streaming response. ## ❌ 안티패턴 - **N=1 + temperature=0**: 매 BoN 의 X. - **Same temperature 의 모든 sample**: 매 diversity X. - **Reward hacking 무시**: 매 RM 의 exploit. - **N → ∞**: 매 cost ↑↑, 매 quality plateau. - **Verifier 없 + RM 없**: 매 BoN 의 X. - **Latency-critical 의 BoN**: 매 wrong tool. ## 🧪 검증 / 중복 - Verified (Wang et al. 2022, OpenAI o1, Cobbe et al.). - 신뢰도 A. - Related: [[Self-Consistency]] · [[Tree-of-Thought]] · [[RLHF]] · [[Chain-of-Thought]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — methods + economics + RFT + ToT + 매 vLLM code |