--- id: [[P-Reinforce|P-Reinforce]]-AI-BEST-OF-N category: Unified confidence_score: 0.99 tags: [Best-of-N, Sampling, Inference, Reward Model, AI [[Alignment|Alignment]]] last_reinforced: 2026-04-20 --- # [[Best-of-N-Sampling|Best-of-N-Sampling]] (Best-of-N μƒ˜ν”Œλ§) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "μ—΄ μ •μŠΉλ³΄λ‹€ λ‚˜μ€ ν•œ λͺ…μ˜ μž₯κ΅° μ°ΎκΈ°." LLM이 μƒμ„±ν•œ N개의 κ²°κ³Όλ¬Ό 쀑, 보상 λͺ¨λΈ(Reward Model)이 κ°€μž₯ μš°μˆ˜ν•˜λ‹€κ³  νŒλ‹¨ν•œ 단 ν•˜λ‚˜μ˜ 닡변을 μ„ νƒν•˜μ—¬ ν’ˆμ§ˆμ„ κ·ΉλŒ€ν™”ν•˜λŠ” μΆ”λ‘  μ „λž΅μ΄λ‹€. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **Generation & Scoring**: - λ™μΌν•œ ν”„λ‘¬ν”„νŠΈμ— λŒ€ν•΄ μ •μ±… λͺ¨λΈ(Policy)이 μ—¬λŸ¬ 개의 λ…λ¦½λœ 닡변을 μƒμ„±ν•˜κ³ , 이λ₯Ό λ³„λ„μ˜ 채점 λͺ¨λΈ(Reward)이 ν‰κ°€ν•œλ‹€. - **Inference Time Compute**: - λͺ¨λΈμ„ 더 ν‚€μš°λŠ” λŒ€μ‹  'μΆ”λ‘  λ‹¨κ³„μ˜ μ—°μ‚°λŸ‰'을 늘렀 μ„±λŠ₯을 ν–₯μƒμ‹œν‚€λŠ” 경제적인 μ„±λŠ₯ 고도화 방법(Scaling Laws for Inference). - **Quality Control**: - ν™˜κ°μ΄ λ°œμƒν•œ λ‹΅λ³€μ΄λ‚˜ μ•ˆμ „ κ°€μ΄λ“œλΌμΈμ„ μ–΄κΈ΄ 닡변을 ν•„ν„°λ§ν•˜κ³  κ°€μž₯ 논리적인 결과물을 λ„μΆœν•œλ‹€. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (RL Update) - N이 컀질수둝 ν’ˆμ§ˆμ€ μ’‹μ•„μ§€μ§€λ§Œ μ½”μŠ€νŠΈ(λΉ„μš©)와 μ§€μ—° μ‹œκ°„(Latency)이 κΈ°ν•˜κΈ‰μˆ˜μ μœΌλ‘œ λŠ˜μ–΄λ‚œλ‹€. λ”°λΌμ„œ μ„œλΉ„μŠ€μ˜ μ‹€μ‹œκ°„μ„± μš”κ΅¬λ„μ— 따라 N의 μ μ ˆν•œ 값을 μ •ν•˜λŠ” 것이 μ—”μ§€λ‹ˆμ–΄λ§μ˜ λ¬˜λ―Έλ‹€. ## πŸ”— 지식 μ—°κ²° (Graph) - Related: [[Prompt-Engineering|Prompt-Engineering]] , [[Reinforcement-Learning|Reinforcement-Learning]]-from-Human-Feedback-(RLHF) - Metric: Reward-Model-Training