Files
2nd/10_Wiki/Topics/AI_and_ML/Best-of-N_Sampling.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.5 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-best-of-n-sampling Best-of-N Sampling 10_Wiki/Topics verified self
Best-of-N
BoN
rejection sampling
inference-time compute
majority voting
self-consistency
none A 0.92 applied
llm
inference
reasoning
reward-model
rejection-sampling
test-time-compute
o1
self-consistency
2026-05-10 pending
language framework
Python Transformers / vLLM / TRL

Best-of-N Sampling

📌 한 줄 통찰

"많이 뽑고 best 의 select". 매 N response 의 generate + RM 의 score → best 1 의 output. 매 inference-time compute 의 가장 simple form. 매 OpenAI o1 / DeepSeek R1 의 underlying principle 의 base case.

📖 핵심

매 algorithm

  1. 매 prompt → N response (temperature > 0).
  2. 매 response 의 score (Reward Model / verifier / majority vote).
  3. 매 best 의 select.

매 selection method

Method Use case
Reward Model 매 general (RLHF reward)
Verifier 매 math, code (correctness)
Majority Vote (Self-Consistency) 매 reasoning 의 final answer
Process Reward Model (PRM) 매 step-by-step
LLM-as-judge 매 subjective (creative)

매 inference-time compute

  • 매 model size ↑ X — 매 inference 의 N ↑.
  • 매 small model + N=64 가 매 large model 의 single 의 outperform.
  • 매 RL 의 alternative.
  • 매 OpenAI o1 / o3 의 chain-of-thought 의 internal sampling.

Self-Consistency (Wang et al. 2022)

  • 매 chain-of-thought 의 N response 의 generate.
  • 매 final answer 의 majority vote.
  • 매 GSM8K + 매 17%p improvement.

매 economics

N Quality Cost
1 baseline 1×
4 +5-10%p 4×
16 +10-15%p 16×
64 +15-20%p 64×
256 diminishing 256×

→ 매 sweet spot 의 task-dependent.

매 variant

Rejection sampling fine-tune (RFT)

  • 매 N response → 매 verifier 의 pass 의 select → 매 SFT.
  • 매 LLaMA-3 / DeepSeek 의 use.

Iterative refinement

  • 매 N → best → 매 다시 N → ... → 매 converge.

Tree-of-Thought (ToT)

  • 매 BoN + 매 search.
  • 매 backtrack OK.
  • 매 N parallel + step-wise prune.

매 weakness

  1. Reward hacking: 매 RM 의 spurious feature 의 exploit.
  2. Diversity collapse: 매 high temperature 가 X → 매 N 의 same.
  3. Cost: 매 N× compute.
  4. Latency: 매 user-facing 의 X.

→ 매 cost-aware 의 N tuning.

💻 패턴

Self-consistency (vote)

import collections
from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3-8B')
sampling = SamplingParams(n=8, temperature=0.7, max_tokens=512)

prompt = "What is 1234 * 5678? Show your reasoning step by step. End with 'Answer: <number>'."
outputs = llm.generate([prompt], sampling)

answers = []
for o in outputs[0].outputs:
    match = re.search(r'Answer:\s*(\d+)', o.text)
    if match: answers.append(int(match.group(1)))

final = collections.Counter(answers).most_common(1)[0][0]

Best-of-N with Reward Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer

rm_model = AutoModelForSequenceClassification.from_pretrained('reward-model')
rm_tokenizer = AutoTokenizer.from_pretrained('reward-model')

def score(prompt, response):
    inputs = rm_tokenizer(prompt + response, return_tensors='pt', truncation=True)
    return rm_model(**inputs).logits[0, 0].item()

def best_of_n(prompt, n=16, T=0.8):
    sampling = SamplingParams(n=n, temperature=T, max_tokens=512)
    outputs = llm.generate([prompt], sampling)[0].outputs
    scored = [(o.text, score(prompt, o.text)) for o in outputs]
    return max(scored, key=lambda x: x[1])[0]

Rejection sampling for fine-tune

def generate_rft_dataset(prompts, verifier, n=8):
    dataset = []
    for prompt in prompts:
        candidates = generate_n(prompt, n=n)
        passing = [c for c in candidates if verifier(prompt, c)]
        if passing:
            best = max(passing, key=lambda c: c.score)
            dataset.append({'prompt': prompt, 'response': best.text})
    return dataset

# 매 SFT on 매 dataset

→ 매 self-improvement loop.

Tree-of-Thought (simplified)

def tot_search(prompt, depth=3, breadth=4):
    state = [prompt]
    for d in range(depth):
        candidates = []
        for s in state:
            children = generate_n(s, n=breadth)
            for c in children:
                score = evaluate(c)
                candidates.append((s + '\n' + c.text, score))
        candidates.sort(key=lambda x: -x[1])
        state = [c[0] for c in candidates[:breadth]]
    return state[0]

LLM-as-judge selection

def llm_judge(prompt, candidates):
    judge_prompt = f"""Given the prompt:
{prompt}

Rate each response 1-10. Pick the best.

{format_candidates(candidates)}

Reply with: BEST=<index>"""
    judgment = llm.generate(judge_prompt)
    idx = int(re.search(r'BEST=(\d+)', judgment).group(1))
    return candidates[idx]

🤔 결정 기준

상황 Method
Math / verifiable Self-consistency (vote)
Code Verifier (test 실행)
General quality RM-based BoN
Subjective LLM-as-judge
Self-improve RFT
매 deep reasoning Tree-of-Thought / o1-style

기본값: Self-consistency (8-16) 의 baseline. 매 RM 가 있으면 BoN.

🔗 Graph

🤖 LLM 활용

언제: 매 verifiable task (math, code). 매 quality > latency. 매 RM available. 매 self-improvement loop. 언제 X: 매 strict latency. 매 RM 없 + verifier 없. 매 streaming response.

안티패턴

  • N=1 + temperature=0: 매 BoN 의 X.
  • Same temperature 의 모든 sample: 매 diversity X.
  • Reward hacking 무시: 매 RM 의 exploit.
  • N → ∞: 매 cost ↑↑, 매 quality plateau.
  • Verifier 없 + RM 없: 매 BoN 의 X.
  • Latency-critical 의 BoN: 매 wrong tool.

🧪 검증 / 중복

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — methods + economics + RFT + ToT + 매 vLLM code