[G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
@@ -1,134 +1,215 @@
 ---
 id: wiki-2026-0508-best-of-n-sampling
-title: Best of N Sampling
+title: Best-of-N Sampling
 category: 10_Wiki/Topics
-status: needs_review
+status: verified
 canonical_id: self
-aliases: []
+aliases: [Best-of-N, BoN, rejection sampling, inference-time compute, majority voting, self-consistency]
 duplicate_of: none
 source_trust_level: A
 confidence_score: 0.92
-tags: [auto-consolidated, technical-documentation]
+verification_status: applied
+tags: [llm, inference, reasoning, reward-model, rejection-sampling, test-time-compute, o1, self-consistency]
 raw_sources: []
-last_reinforced: 2026-05-08
+last_reinforced: 2026-05-10
 github_commit: pending
-inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
+tech_stack:
+  language: Python
+  framework: Transformers / vLLM / TRL
 ---

-# [[Best-of-N-Sampling|Best-of-N-Sampling]] (베스트 오브 N 샘플링)
+# Best-of-N Sampling

-## 📌 한 줄 통찰 (The Karpathy Summary)
-> "열 번 찍어 안 넘어가는 나무 없다." AI에게 N번 시도하게 하고, 그중 가장 '정답에 가까운' 결과물을 보상 모델(Reward Model)로 골라내는 필승 전략이다.
+## 📌 한 줄 통찰
+> **"많이 뽑고 best 의 select"**. 매 N response 의 generate + RM 의 score → best 1 의 output. 매 inference-time compute 의 가장 simple form. 매 OpenAI o1 / DeepSeek R1 의 underlying principle 의 base case.

---
+## 📖 핵심

-> "많이 뽑고 가장 좋은 것을 골라라" — 모델로부터 N개의 응답을 생성한 뒤, 별도의 보상 모델(RM)이나 채점 기준을 통해 가장 품질이 높은 최적의 답변 하나를 선택하는 추론 최적화 기법.
+### 매 algorithm
+1. 매 prompt → N response (temperature > 0).
+2. 매 response 의 score (Reward Model / verifier / majority vote).
+3. 매 best 의 select.

---
+### 매 selection method
+| Method | Use case |
+|---|---|
+| Reward Model | 매 general (RLHF reward) |
+| Verifier | 매 math, code (correctness) |
+| Majority Vote (Self-Consistency) | 매 reasoning 의 final answer |
+| Process Reward Model (PRM) | 매 step-by-step |
+| LLM-as-judge | 매 subjective (creative) |

-> "열 정승보다 나은 한 명의 장군 찾기." LLM이 생성한 N개의 결과물 중, 보상 모델(Reward Model)이 가장 우수하다고 판단한 단 하나의 답변을 선택하여 품질을 극대화하는 추론 전략이다.
+### 매 inference-time compute
+- 매 model size ↑ X — 매 inference 의 N ↑.
+- 매 small model + N=64 가 매 large model 의 single 의 outperform.
+- 매 RL 의 alternative.
+- 매 OpenAI o1 / o3 의 chain-of-thought 의 internal sampling.

---
+### Self-Consistency (Wang et al. 2022)
+- 매 chain-of-thought 의 N response 의 generate.
+- 매 final answer 의 majority vote.
+- 매 GSM8K + 매 17%p improvement.

-> "지능의 물량 공세: 한 번에 정답을 맞히려 애쓰기보다, N개의 답변을 동시에 생성한 뒤 그중 가장 논리적이고 정확한 '최선의 답변'을 골라내는 방식으로 추론 능력을 비약적으로 끌어올리는 인퍼런스 최적화 전술."
+### 매 economics
+| N | Quality | Cost |
+|---|---|---|
+| 1 | baseline | 1× |
+| 4 | +5-10%p | 4× |
+| 16 | +10-15%p | 16× |
+| 64 | +15-20%p | 64× |
+| 256 | diminishing | 256× |

-## 📖 구조화된 지식 (Synthesized Content)
- **추론 시간 연산 (Inference-time Compute)**:
-    - 모델의 크기를 키우는 대신, 추론 시점에 더 많은 계산을 수행하여 답변의 품질을 높이는 기법. 최근 OpenAI o1 등 추론 모델의 핵심 원리 중 하나다.
- **Reward Modeling (RM)**:
-    - N개의 답변 중 어떤 것이 가장 좋은지 판별하는 별도의 '감별사 AI'를 투입한다. 인간의 선호도(RLHF)를 반영한 RM이 최종 선택을 담당한다.
- **Majority Voting vs Selection**:
-    - 수학 문제라면 답변들 중 가장 많이 나온 값(Majority Vote)을 택하고, 창의적 답변이라면 RM 스코어가 가장 높은 것을 택한다.
+→ 매 sweet spot 의 task-dependent.

---
+### 매 variant

- **추출된 패턴:** 생성(Generation)과 검증(Verification) 단계를 분리하여, 단일 생성 시 발생할 수 있는 환각(Hallucination)이나 저품질 응답 리스크를 통계적으로 억제하는 패턴.
- **세부 내용:**
-    - **N개 생성:** 동일한 프롬프트에 대해 온도를 조절하며 독립적인 N개의 응답 후보군을 확보.
-    - **Reward Model (RM):** 각 후보 응답의 논리성, 안전성, 정확성을 평가하여 점수를 부여.
-    - **Rejection Sampling:** 점수가 낮은 응답은 버리고 최고점을 받은 응답만을 최종 출력으로 선택.
-    - **연산 비용:** 추론 시 N배의 컴퓨팅 자원이 소모되지만, 결과물의 신뢰도를 비약적으로 상승시킴.
+#### Rejection sampling fine-tune (RFT)
+- 매 N response → 매 verifier 의 pass 의 select → 매 SFT.
+- 매 LLaMA-3 / DeepSeek 의 use.

---
+#### Iterative refinement
+- 매 N → best → 매 다시 N → ... → 매 converge.

- **Generation & Scoring**:
-    - 동일한 프롬프트에 대해 정책 모델(Policy)이 여러 개의 독립된 답변을 생성하고, 이를 별도의 채점 모델(Reward)이 평가한다.
- **Inference Time Compute**:
-    - 모델을 더 키우는 대신 '추론 단계의 연산량'을 늘려 성능을 향상시키는 경제적인 성능 고도화 방법(Scaling Laws for Inference).
- **Quality Control**:
-    - 환각이 발생한 답변이나 안전 가이드라인을 어긴 답변을 필터링하고 가장 논리적인 결과물을 도출한다.
+#### Tree-of-Thought (ToT)
+- 매 BoN + 매 search.
+- 매 backtrack OK.

---
+#### Beam search
+- 매 N parallel + step-wise prune.

-[[Best-of-N Sampling|Best-of-N Sampling]](최적 샘플링)은 거대 언어 모델(LLM)의 추론 품질을 높이기 위해 사용되는 디코딩 시점의 리랭킹(Reranking) 기법입니다.
+### 매 weakness
+1. **Reward hacking**: 매 RM 의 spurious feature 의 exploit.
+2. **Diversity collapse**: 매 high temperature 가 X → 매 N 의 same.
+3. **Cost**: 매 N× compute.
+4. **Latency**: 매 user-facing 의 X.

-1.  **메커니즘**:
-    *   **Generation**: 동일한 프롬프트에 대해 Temperature를 조절하여 N개의 독립적인 답변 후보를 생성.
-    *   **Scoring (Reward Model)**: 생성된 N개의 답변을 보상 모델(RM)이나 특정 검증 로직(Verifier)으로 평가.
-    *   **Selection**: 가장 높은 점수를 받은 답변을 최종 출력으로 선택.
-2.  **왜 중요한가?**:
-    *   모델 자체를 추가 학습(Training)시키지 않고도, 추론 시점의 연산 자원(Inference compute)을 추가 투입하여 [[SOTA|SOTA]] 급의 성능을 낼 수 있기 때문임. ([[Scalability|Scalability]]와 연결)
+→ 매 cost-aware 의 N tuning.

-## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- N이 클수록 품질은 올라가지만 비용과 응답 지연 시간(Latency)이 기하급수적으로 늘어난다. 실시간 서비스에서는 N=3~5 수준의 타협점이 요구되며, 최근에는 자가 수정(Self-Correction) 능력을 키우는 쪽으로 연구가 이동 중이다.
+## 💻 패턴

---
+### Self-consistency (vote)
+```python
+import collections
+from vllm import LLM, SamplingParams

- **과거 데이터와의 충돌:** 단순히 확률 기반으로 다음 토큰을 고르던 방식에서, 전체 문맥의 완성도를 사후에 평가하는 '검증 기반 추론'으로의 발전.
- **정책 변화:** 실시간 응답이 중요한 챗봇보다는 정확도가 생명인 코드 생성이나 데이터 추출 에이전트에서 주로 채택됨.
+llm = LLM(model='meta-llama/Llama-3-8B')
+sampling = SamplingParams(n=8, temperature=0.7, max_tokens=512)

---
+prompt = "What is 1234 * 5678? Show your reasoning step by step. End with 'Answer: <number>'."
+outputs = llm.generate([prompt], sampling)

- N이 커질수록 품질은 좋아지지만 코스트(비용)와 지연 시간(Latency)이 기하급수적으로 늘어난다. 따라서 서비스의 실시간성 요구도에 따라 N의 적절한 값을 정하는 것이 엔지니어링의 묘미다.
+answers = []
+for o in outputs[0].outputs:
+    match = re.search(r'Answer:\s*(\d+)', o.text)
+    if match: answers.append(int(match.group(1)))

---
+final = collections.Counter(answers).most_common(1)[0][0]
+```

- **과거 데이터와의 충돌**: 과거에는 무조건 '가장 확률 높은 다음 토큰(Greedy [[Search|Search]])'만 찾는 것이 최선이라 여겼으나, 현대 정책은 다양성 정책(Diversity)을 확보한 뒤 사후 검증 정책(Post-verification)을 거치는 것이 훨씬 더 복잡한 추론 문제 정책에 효과적임을 증명함(RL Update).
- **정책 변화(RL Update)**: 최근 OpenAI o1 등 추론 전문 모델 정책은 단순히 N개를 뽑는 수준을 넘어, 생각의 체인(CoT) 과정 자체를 검증하고 수정하는 시스템으로 진화 중임. (Tree-of-Thought와 연결)
+### Best-of-N with Reward Model
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer

-## 🔗 지식 연결 (Graph)
- Related: Reinforcement Learning , AI 모델 평가
- Context: [[Information Theory|Information Theory]]
+rm_model = AutoModelForSequenceClassification.from_pretrained('reward-model')
+rm_tokenizer = AutoTokenizer.from_pretrained('reward-model')

---
+def score(prompt, response):
+    inputs = rm_tokenizer(prompt + response, return_tensors='pt', truncation=True)
+    return rm_model(**inputs).logits[0, 0].item()

- **Parent:** 10_Wiki/💡 Topics/AI
- **Related:** Chain-of-Thought, Self-Consistency, Reward-Modeling
- **Raw Source:** 00_Raw/2026-04-20/[[Best-of-N Sampling|Best-of-N Sampling]].md
+def best_of_n(prompt, n=16, T=0.8):
+    sampling = SamplingParams(n=n, temperature=T, max_tokens=512)
+    outputs = llm.generate([prompt], sampling)[0].outputs
+    scored = [(o.text, score(prompt, o.text)) for o in outputs]
+    return max(scored, key=lambda x: x[1])[0]
+```

---
+### Rejection sampling for fine-tune
+```python
+def generate_rft_dataset(prompts, verifier, n=8):
+    dataset = []
+    for prompt in prompts:
+        candidates = generate_n(prompt, n=n)
+        passing = [c for c in candidates if verifier(prompt, c)]
+        if passing:
+            best = max(passing, key=lambda c: c.score)
+            dataset.append({'prompt': prompt, 'response': best.text})
+    return dataset

- Related: [[Prompt-Engineering|Prompt-Engineering]] , [[Reinforcement-Learning|Reinforcement-Learning]]-from-Human-Feedback-(RLHF)
- Metric: Reward-Model-Training
+# 매 SFT on 매 dataset
+```

---
+→ 매 self-improvement loop.

- [[Scalability|Scalability]], [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]], Tree-of-Thought, [[Search-Strategy|Search-Strategy]], Inference
- **Related Terms**: Rejection Sampling, Majority Voting, Thought-level Verifiers.
---
+### Tree-of-Thought (simplified)
+```python
+def tot_search(prompt, depth=3, breadth=4):
+    state = [prompt]
+    for d in range(depth):
+        candidates = []
+        for s in state:
+            children = generate_n(s, n=breadth)
+            for c in children:
+                score = evaluate(c)
+                candidates.append((s + '\n' + c.text, score))
+        candidates.sort(key=lambda x: -x[1])
+        state = [c[0] for c in candidates[:breadth]]
+    return state[0]
+```

-## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
+### LLM-as-judge selection
+```python
+def llm_judge(prompt, candidates):
+    judge_prompt = f"""Given the prompt:
+{prompt}

-**언제 이 지식을 쓰는가:**
- *(TODO)*
+Rate each response 1-10. Pick the best.

-**언제 쓰면 안 되는가:**
- *(TODO)*
+{format_candidates(candidates)}

-## 🧪 검증 상태 (Validation)
+Reply with: BEST=<index>"""
+    judgment = llm.generate(judge_prompt)
+    idx = int(re.search(r'BEST=(\d+)', judgment).group(1))
+    return candidates[idx]
+```

- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
+## 🤔 결정 기준
+| 상황 | Method |
+|---|---|
+| Math / verifiable | Self-consistency (vote) |
+| Code | Verifier (test 실행) |
+| General quality | RM-based BoN |
+| Subjective | LLM-as-judge |
+| Self-improve | RFT |
+| 매 deep reasoning | Tree-of-Thought / o1-style |

-## 🧬 중복 검사 (Duplicate Check)
+**기본값**: Self-consistency (8-16) 의 baseline. 매 RM 가 있으면 BoN.

- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
+## 🔗 Graph
+- 부모: [[LLM-Inference]] · [[Test-Time-Compute]]
+- 변형: [[Self-Consistency]] · [[Rejection-Sampling]] · [[Tree-of-Thought]] · [[Beam-Search]]
+- 응용: [[OpenAI-o1]] · [[DeepSeek-R1]] · [[RFT]] · [[Process-Reward-Model]]
+- Adjacent: [[Reward-Model]] · [[RLHF]] · [[Chain-of-Thought]] · [[LLM-as-Judge]]

-## 🕓 변경 이력 (Changelog)
+## 🤖 LLM 활용
+**언제**: 매 verifiable task (math, code). 매 quality > latency. 매 RM available. 매 self-improvement loop.
+**언제 X**: 매 strict latency. 매 RM 없 + verifier 없. 매 streaming response.

-| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
-|------|-----------|-----------|--------|
-| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
+## ❌ 안티패턴
+- **N=1 + temperature=0**: 매 BoN 의 X.
+- **Same temperature 의 모든 sample**: 매 diversity X.
+- **Reward hacking 무시**: 매 RM 의 exploit.
+- **N → ∞**: 매 cost ↑↑, 매 quality plateau.
+- **Verifier 없 + RM 없**: 매 BoN 의 X.
+- **Latency-critical 의 BoN**: 매 wrong tool.
+
+## 🧪 검증 / 중복
+- Verified (Wang et al. 2022, OpenAI o1, Cobbe et al.).
+- 신뢰도 A.
+- Related: [[Self-Consistency]] · [[Tree-of-Thought]] · [[RLHF]] · [[Chain-of-Thought]].
+
+## 🕓 Changelog
+| 날짜 | 변경 |
+|---|---|
+| 2026-05-08 | Phase 1 |
+| 2026-05-10 | Manual cleanup — methods + economics + RFT + ToT + 매 vLLM code |