---
id: wiki-2026-0508-best-of-n-sampling
title: Best-of-N Sampling
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Best-of-N, BoN, rejection sampling, inference-time compute, majority voting, self-consistency]
duplicate_of: none
source_trust_level: A
confidence_score: 0.92
verification_status: applied
tags: [llm, inference, reasoning, reward-model, rejection-sampling, test-time-compute, o1, self-consistency]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: Transformers / vLLM / TRL
---

# Best-of-N Sampling

## 📌 한 줄 통찰
> **"많이 뽑고 best 의 select"**. 매 N response 의 generate + RM 의 score → best 1 의 output. 매 inference-time compute 의 가장 simple form. 매 OpenAI o1 / DeepSeek R1 의 underlying principle 의 base case.

## 📖 핵심

### 매 algorithm
1. 매 prompt → N response (temperature > 0).
2. 매 response 의 score (Reward Model / verifier / majority vote).
3. 매 best 의 select.

### 매 selection method
| Method | Use case |
|---|---|
| Reward Model | 매 general (RLHF reward) |
| Verifier | 매 math, code (correctness) |
| Majority Vote (Self-Consistency) | 매 reasoning 의 final answer |
| Process Reward Model (PRM) | 매 step-by-step |
| LLM-as-judge | 매 subjective (creative) |

### 매 inference-time compute
- 매 model size ↑ X — 매 inference 의 N ↑.
- 매 small model + N=64 가 매 large model 의 single 의 outperform.
- 매 RL 의 alternative.
- 매 OpenAI o1 / o3 의 chain-of-thought 의 internal sampling.

### Self-Consistency (Wang et al. 2022)
- 매 chain-of-thought 의 N response 의 generate.
- 매 final answer 의 majority vote.
- 매 GSM8K + 매 17%p improvement.

### 매 economics
| N | Quality | Cost |
|---|---|---|
| 1 | baseline | 1× |
| 4 | +5-10%p | 4× |
| 16 | +10-15%p | 16× |
| 64 | +15-20%p | 64× |
| 256 | diminishing | 256× |

→ 매 sweet spot 의 task-dependent.

### 매 variant

#### Rejection sampling fine-tune (RFT)
- 매 N response → 매 verifier 의 pass 의 select → 매 SFT.
- 매 LLaMA-3 / DeepSeek 의 use.

#### Iterative refinement
- 매 N → best → 매 다시 N → ... → 매 converge.

#### Tree-of-Thought (ToT)
- 매 BoN + 매 search.
- 매 backtrack OK.

#### Beam search
- 매 N parallel + step-wise prune.

### 매 weakness
1. **Reward hacking**: 매 RM 의 spurious feature 의 exploit.
2. **Diversity collapse**: 매 high temperature 가 X → 매 N 의 same.
3. **Cost**: 매 N× compute.
4. **Latency**: 매 user-facing 의 X.

→ 매 cost-aware 의 N tuning.

## 💻 패턴

### Self-consistency (vote)
```python
import collections
from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3-8B')
sampling = SamplingParams(n=8, temperature=0.7, max_tokens=512)

prompt = "What is 1234 * 5678? Show your reasoning step by step. End with 'Answer: <number>'."
outputs = llm.generate([prompt], sampling)

answers = []
for o in outputs[0].outputs:
    match = re.search(r'Answer:\s*(\d+)', o.text)
    if match: answers.append(int(match.group(1)))

final = collections.Counter(answers).most_common(1)[0][0]
```

### Best-of-N with Reward Model
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

rm_model = AutoModelForSequenceClassification.from_pretrained('reward-model')
rm_tokenizer = AutoTokenizer.from_pretrained('reward-model')

def score(prompt, response):
    inputs = rm_tokenizer(prompt + response, return_tensors='pt', truncation=True)
    return rm_model(**inputs).logits[0, 0].item()

def best_of_n(prompt, n=16, T=0.8):
    sampling = SamplingParams(n=n, temperature=T, max_tokens=512)
    outputs = llm.generate([prompt], sampling)[0].outputs
    scored = [(o.text, score(prompt, o.text)) for o in outputs]
    return max(scored, key=lambda x: x[1])[0]
```

### Rejection sampling for fine-tune
```python
def generate_rft_dataset(prompts, verifier, n=8):
    dataset = []
    for prompt in prompts:
        candidates = generate_n(prompt, n=n)
        passing = [c for c in candidates if verifier(prompt, c)]
        if passing:
            best = max(passing, key=lambda c: c.score)
            dataset.append({'prompt': prompt, 'response': best.text})
    return dataset

# 매 SFT on 매 dataset
```

→ 매 self-improvement loop.

### Tree-of-Thought (simplified)
```python
def tot_search(prompt, depth=3, breadth=4):
    state = [prompt]
    for d in range(depth):
        candidates = []
        for s in state:
            children = generate_n(s, n=breadth)
            for c in children:
                score = evaluate(c)
                candidates.append((s + '\n' + c.text, score))
        candidates.sort(key=lambda x: -x[1])
        state = [c[0] for c in candidates[:breadth]]
    return state[0]
```

### LLM-as-judge selection
```python
def llm_judge(prompt, candidates):
    judge_prompt = f"""Given the prompt:
{prompt}

Rate each response 1-10. Pick the best.

{format_candidates(candidates)}

Reply with: BEST=<index>"""
    judgment = llm.generate(judge_prompt)
    idx = int(re.search(r'BEST=(\d+)', judgment).group(1))
    return candidates[idx]
```

## 🤔 결정 기준
| 상황 | Method |
|---|---|
| Math / verifiable | Self-consistency (vote) |
| Code | Verifier (test 실행) |
| General quality | RM-based BoN |
| Subjective | LLM-as-judge |
| Self-improve | RFT |
| 매 deep reasoning | Tree-of-Thought / o1-style |

**기본값**: Self-consistency (8-16) 의 baseline. 매 RM 가 있으면 BoN.

## 🔗 Graph
- 부모: [[Test-Time-Compute]]
- 변형: [[Self-Consistency]] · [[Rejection-Sampling]]
- 응용: [[DeepSeek-R1]]
- Adjacent: [[Reward-Model]] · [[RLHF]] · [[Chain-of-Thought]] · [[LLM-as-Judge]]

## 🤖 LLM 활용
**언제**: 매 verifiable task (math, code). 매 quality > latency. 매 RM available. 매 self-improvement loop.
**언제 X**: 매 strict latency. 매 RM 없 + verifier 없. 매 streaming response.

## ❌ 안티패턴
- **N=1 + temperature=0**: 매 BoN 의 X.
- **Same temperature 의 모든 sample**: 매 diversity X.
- **Reward hacking 무시**: 매 RM 의 exploit.
- **N → ∞**: 매 cost ↑↑, 매 quality plateau.
- **Verifier 없 + RM 없**: 매 BoN 의 X.
- **Latency-critical 의 BoN**: 매 wrong tool.

## 🧪 검증 / 중복
- Verified (Wang et al. 2022, OpenAI o1, Cobbe et al.).
- 신뢰도 A.
- Related: [[Self-Consistency]] · [[Tree-of-Thought]] · [[RLHF]] · [[Chain-of-Thought]].

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — methods + economics + RFT + ToT + 매 vLLM code |