Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

6.6 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Best-of-N Sampling

📌 한 줄 통찰

"많이 뽑고 best 의 select". 매 N response 의 generate + RM 의 score → best 1 의 output. 매 inference-time compute 의 가장 simple form. 매 OpenAI o1 / DeepSeek R1 의 underlying principle 의 base case.

📖 핵심

매 algorithm

매 prompt → N response (temperature > 0).
매 response 의 score (Reward Model / verifier / majority vote).
매 best 의 select.

매 selection method

Method	Use case
Reward Model	매 general (RLHF reward)
Verifier	매 math, code (correctness)
Majority Vote (Self-Consistency)	매 reasoning 의 final answer
Process Reward Model (PRM)	매 step-by-step
LLM-as-judge	매 subjective (creative)

매 inference-time compute

매 model size ↑ X — 매 inference 의 N ↑.
매 small model + N=64 가 매 large model 의 single 의 outperform.
매 RL 의 alternative.
매 OpenAI o1 / o3 의 chain-of-thought 의 internal sampling.

Self-Consistency (Wang et al. 2022)

매 chain-of-thought 의 N response 의 generate.
매 final answer 의 majority vote.
매 GSM8K + 매 17%p improvement.

매 economics

N	Quality	Cost
1	baseline	1×
4	+5-10%p	4×
16	+10-15%p	16×
64	+15-20%p	64×
256	diminishing	256×

→ 매 sweet spot 의 task-dependent.

매 variant

Rejection sampling fine-tune (RFT)

매 N response → 매 verifier 의 pass 의 select → 매 SFT.
매 LLaMA-3 / DeepSeek 의 use.

Iterative refinement

매 N → best → 매 다시 N → ... → 매 converge.

Tree-of-Thought (ToT)

매 BoN + 매 search.
매 backtrack OK.

Beam search

매 N parallel + step-wise prune.

매 weakness

Reward hacking: 매 RM 의 spurious feature 의 exploit.
Diversity collapse: 매 high temperature 가 X → 매 N 의 same.
Cost: 매 N× compute.
Latency: 매 user-facing 의 X.

→ 매 cost-aware 의 N tuning.

💻 패턴

Self-consistency (vote)

import collections
from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3-8B')
sampling = SamplingParams(n=8, temperature=0.7, max_tokens=512)

prompt = "What is 1234 * 5678? Show your reasoning step by step. End with 'Answer: <number>'."
outputs = llm.generate([prompt], sampling)

answers = []
for o in outputs[0].outputs:
    match = re.search(r'Answer:\s*(\d+)', o.text)
    if match: answers.append(int(match.group(1)))

final = collections.Counter(answers).most_common(1)[0][0]

Best-of-N with Reward Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer

rm_model = AutoModelForSequenceClassification.from_pretrained('reward-model')
rm_tokenizer = AutoTokenizer.from_pretrained('reward-model')

def score(prompt, response):
    inputs = rm_tokenizer(prompt + response, return_tensors='pt', truncation=True)
    return rm_model(**inputs).logits[0, 0].item()

def best_of_n(prompt, n=16, T=0.8):
    sampling = SamplingParams(n=n, temperature=T, max_tokens=512)
    outputs = llm.generate([prompt], sampling)[0].outputs
    scored = [(o.text, score(prompt, o.text)) for o in outputs]
    return max(scored, key=lambda x: x[1])[0]

Rejection sampling for fine-tune

def generate_rft_dataset(prompts, verifier, n=8):
    dataset = []
    for prompt in prompts:
        candidates = generate_n(prompt, n=n)
        passing = [c for c in candidates if verifier(prompt, c)]
        if passing:
            best = max(passing, key=lambda c: c.score)
            dataset.append({'prompt': prompt, 'response': best.text})
    return dataset

# 매 SFT on 매 dataset

→ 매 self-improvement loop.

Tree-of-Thought (simplified)

def tot_search(prompt, depth=3, breadth=4):
    state = [prompt]
    for d in range(depth):
        candidates = []
        for s in state:
            children = generate_n(s, n=breadth)
            for c in children:
                score = evaluate(c)
                candidates.append((s + '\n' + c.text, score))
        candidates.sort(key=lambda x: -x[1])
        state = [c[0] for c in candidates[:breadth]]
    return state[0]

LLM-as-judge selection

def llm_judge(prompt, candidates):
    judge_prompt = f"""Given the prompt:
{prompt}

Rate each response 1-10. Pick the best.

{format_candidates(candidates)}

Reply with: BEST=<index>"""
    judgment = llm.generate(judge_prompt)
    idx = int(re.search(r'BEST=(\d+)', judgment).group(1))
    return candidates[idx]

🤔 결정 기준

상황	Method
Math / verifiable	Self-consistency (vote)
Code	Verifier (test 실행)
General quality	RM-based BoN
Subjective	LLM-as-judge
Self-improve	RFT
매 deep reasoning	Tree-of-Thought / o1-style

기본값: Self-consistency (8-16) 의 baseline. 매 RM 가 있으면 BoN.

🔗 Graph

부모: LLM-Inference · Test-Time-Compute
변형: Self-Consistency · Rejection-Sampling · Tree-of-Thought · Beam-Search
응용: OpenAI-o1 · DeepSeek-R1 · RFT · Process-Reward-Model
Adjacent: Reward-Model · RLHF · Chain-of-Thought · LLM-as-Judge

🤖 LLM 활용

언제: 매 verifiable task (math, code). 매 quality > latency. 매 RM available. 매 self-improvement loop. 언제 X: 매 strict latency. 매 RM 없 + verifier 없. 매 streaming response.

❌ 안티패턴

N=1 + temperature=0: 매 BoN 의 X.
Same temperature 의 모든 sample: 매 diversity X.
Reward hacking 무시: 매 RM 의 exploit.
N → ∞: 매 cost ↑↑, 매 quality plateau.
Verifier 없 + RM 없: 매 BoN 의 X.
Latency-critical 의 BoN: 매 wrong tool.

🧪 검증 / 중복

Verified (Wang et al. 2022, OpenAI o1, Cobbe et al.).
신뢰도 A.
Related: Self-Consistency · Tree-of-Thought · RLHF · Chain-of-Thought.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — methods + economics + RFT + ToT + 매 vLLM code

6.6 KiB Raw Blame History Unescape Escape