f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.5 KiB
6.5 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-best-of-n-sampling | Best-of-N Sampling | 10_Wiki/Topics | verified | self |
|
none | A | 0.92 | applied |
|
2026-05-10 | pending |
|
Best-of-N Sampling
📌 한 줄 통찰
"많이 뽑고 best 의 select". 매 N response 의 generate + RM 의 score → best 1 의 output. 매 inference-time compute 의 가장 simple form. 매 OpenAI o1 / DeepSeek R1 의 underlying principle 의 base case.
📖 핵심
매 algorithm
- 매 prompt → N response (temperature > 0).
- 매 response 의 score (Reward Model / verifier / majority vote).
- 매 best 의 select.
매 selection method
| Method | Use case |
|---|---|
| Reward Model | 매 general (RLHF reward) |
| Verifier | 매 math, code (correctness) |
| Majority Vote (Self-Consistency) | 매 reasoning 의 final answer |
| Process Reward Model (PRM) | 매 step-by-step |
| LLM-as-judge | 매 subjective (creative) |
매 inference-time compute
- 매 model size ↑ X — 매 inference 의 N ↑.
- 매 small model + N=64 가 매 large model 의 single 의 outperform.
- 매 RL 의 alternative.
- 매 OpenAI o1 / o3 의 chain-of-thought 의 internal sampling.
Self-Consistency (Wang et al. 2022)
- 매 chain-of-thought 의 N response 의 generate.
- 매 final answer 의 majority vote.
- 매 GSM8K + 매 17%p improvement.
매 economics
| N | Quality | Cost |
|---|---|---|
| 1 | baseline | 1× |
| 4 | +5-10%p | 4× |
| 16 | +10-15%p | 16× |
| 64 | +15-20%p | 64× |
| 256 | diminishing | 256× |
→ 매 sweet spot 의 task-dependent.
매 variant
Rejection sampling fine-tune (RFT)
- 매 N response → 매 verifier 의 pass 의 select → 매 SFT.
- 매 LLaMA-3 / DeepSeek 의 use.
Iterative refinement
- 매 N → best → 매 다시 N → ... → 매 converge.
Tree-of-Thought (ToT)
- 매 BoN + 매 search.
- 매 backtrack OK.
Beam search
- 매 N parallel + step-wise prune.
매 weakness
- Reward hacking: 매 RM 의 spurious feature 의 exploit.
- Diversity collapse: 매 high temperature 가 X → 매 N 의 same.
- Cost: 매 N× compute.
- Latency: 매 user-facing 의 X.
→ 매 cost-aware 의 N tuning.
💻 패턴
Self-consistency (vote)
import collections
from vllm import LLM, SamplingParams
llm = LLM(model='meta-llama/Llama-3-8B')
sampling = SamplingParams(n=8, temperature=0.7, max_tokens=512)
prompt = "What is 1234 * 5678? Show your reasoning step by step. End with 'Answer: <number>'."
outputs = llm.generate([prompt], sampling)
answers = []
for o in outputs[0].outputs:
match = re.search(r'Answer:\s*(\d+)', o.text)
if match: answers.append(int(match.group(1)))
final = collections.Counter(answers).most_common(1)[0][0]
Best-of-N with Reward Model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
rm_model = AutoModelForSequenceClassification.from_pretrained('reward-model')
rm_tokenizer = AutoTokenizer.from_pretrained('reward-model')
def score(prompt, response):
inputs = rm_tokenizer(prompt + response, return_tensors='pt', truncation=True)
return rm_model(**inputs).logits[0, 0].item()
def best_of_n(prompt, n=16, T=0.8):
sampling = SamplingParams(n=n, temperature=T, max_tokens=512)
outputs = llm.generate([prompt], sampling)[0].outputs
scored = [(o.text, score(prompt, o.text)) for o in outputs]
return max(scored, key=lambda x: x[1])[0]
Rejection sampling for fine-tune
def generate_rft_dataset(prompts, verifier, n=8):
dataset = []
for prompt in prompts:
candidates = generate_n(prompt, n=n)
passing = [c for c in candidates if verifier(prompt, c)]
if passing:
best = max(passing, key=lambda c: c.score)
dataset.append({'prompt': prompt, 'response': best.text})
return dataset
# 매 SFT on 매 dataset
→ 매 self-improvement loop.
Tree-of-Thought (simplified)
def tot_search(prompt, depth=3, breadth=4):
state = [prompt]
for d in range(depth):
candidates = []
for s in state:
children = generate_n(s, n=breadth)
for c in children:
score = evaluate(c)
candidates.append((s + '\n' + c.text, score))
candidates.sort(key=lambda x: -x[1])
state = [c[0] for c in candidates[:breadth]]
return state[0]
LLM-as-judge selection
def llm_judge(prompt, candidates):
judge_prompt = f"""Given the prompt:
{prompt}
Rate each response 1-10. Pick the best.
{format_candidates(candidates)}
Reply with: BEST=<index>"""
judgment = llm.generate(judge_prompt)
idx = int(re.search(r'BEST=(\d+)', judgment).group(1))
return candidates[idx]
🤔 결정 기준
| 상황 | Method |
|---|---|
| Math / verifiable | Self-consistency (vote) |
| Code | Verifier (test 실행) |
| General quality | RM-based BoN |
| Subjective | LLM-as-judge |
| Self-improve | RFT |
| 매 deep reasoning | Tree-of-Thought / o1-style |
기본값: Self-consistency (8-16) 의 baseline. 매 RM 가 있으면 BoN.
🔗 Graph
- 부모: Test-Time-Compute
- 변형: Self-Consistency · Rejection-Sampling
- 응용: DeepSeek-R1
- Adjacent: Reward-Model · RLHF · Chain-of-Thought · LLM-as-Judge
🤖 LLM 활용
언제: 매 verifiable task (math, code). 매 quality > latency. 매 RM available. 매 self-improvement loop. 언제 X: 매 strict latency. 매 RM 없 + verifier 없. 매 streaming response.
❌ 안티패턴
- N=1 + temperature=0: 매 BoN 의 X.
- Same temperature 의 모든 sample: 매 diversity X.
- Reward hacking 무시: 매 RM 의 exploit.
- N → ∞: 매 cost ↑↑, 매 quality plateau.
- Verifier 없 + RM 없: 매 BoN 의 X.
- Latency-critical 의 BoN: 매 wrong tool.
🧪 검증 / 중복
- Verified (Wang et al. 2022, OpenAI o1, Cobbe et al.).
- 신뢰도 A.
- Related: Self-Consistency · Tree-of-Thought · RLHF · Chain-of-Thought.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — methods + economics + RFT + ToT + 매 vLLM code |