--- id: P-REINFORCE-AUTO-BONS-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 0.94 tags: [auto-reinforced, best-of-n, sampling-strategy, inference-optimization, llm, reasoning, reranking] last_reinforced: 2026-04-20 --- # [[Best-of-N-Sampling|Best-of-N-Sampling]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ง€๋Šฅ์˜ ๋ฌผ๋Ÿ‰ ๊ณต์„ธ: ํ•œ ๋ฒˆ์— ์ •๋‹ต์„ ๋งžํžˆ๋ ค ์• ์“ฐ๊ธฐ๋ณด๋‹ค, N๊ฐœ์˜ ๋‹ต๋ณ€์„ ๋™์‹œ์— ์ƒ์„ฑํ•œ ๋’ค ๊ทธ์ค‘ ๊ฐ€์žฅ ๋…ผ๋ฆฌ์ ์ด๊ณ  ์ •ํ™•ํ•œ '์ตœ์„ ์˜ ๋‹ต๋ณ€'์„ ๊ณจ๋ผ๋‚ด๋Š” ๋ฐฉ์‹์œผ๋กœ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋น„์•ฝ์ ์œผ๋กœ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ์ธํผ๋Ÿฐ์Šค ์ตœ์ ํ™” ์ „์ˆ ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) Best-of-N Sampling(์ตœ์  ์ƒ˜ํ”Œ๋ง)์€ ๊ฑฐ๋Œ€ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ถ”๋ก  ํ’ˆ์งˆ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๋””์ฝ”๋”ฉ ์‹œ์ ์˜ ๋ฆฌ๋žญํ‚น(Reranking) ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. 1. **๋ฉ”์ปค๋‹ˆ์ฆ˜**: * **Generation**: ๋™์ผํ•œ ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•ด Temperature๋ฅผ ์กฐ์ ˆํ•˜์—ฌ N๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ๋‹ต๋ณ€ ํ›„๋ณด๋ฅผ ์ƒ์„ฑ. * **Scoring (Reward Model)**: ์ƒ์„ฑ๋œ N๊ฐœ์˜ ๋‹ต๋ณ€์„ ๋ณด์ƒ ๋ชจ๋ธ(RM)์ด๋‚˜ ํŠน์ • ๊ฒ€์ฆ ๋กœ์ง(Verifier)์œผ๋กœ ํ‰๊ฐ€. * **Selection**: ๊ฐ€์žฅ ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์€ ๋‹ต๋ณ€์„ ์ตœ์ข… ์ถœ๋ ฅ์œผ๋กœ ์„ ํƒ. 2. **์™œ ์ค‘์š”ํ•œ๊ฐ€?**: * ๋ชจ๋ธ ์ž์ฒด๋ฅผ ์ถ”๊ฐ€ ํ•™์Šต(Training)์‹œํ‚ค์ง€ ์•Š๊ณ ๋„, ์ถ”๋ก  ์‹œ์ ์˜ ์—ฐ์‚ฐ ์ž์›(Inference compute)์„ ์ถ”๊ฐ€ ํˆฌ์ž…ํ•˜์—ฌ SOTA ๊ธ‰์˜ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž„. (Scalability์™€ ์—ฐ๊ฒฐ) ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ**: ๊ณผ๊ฑฐ์—๋Š” ๋ฌด์กฐ๊ฑด '๊ฐ€์žฅ ํ™•๋ฅ  ๋†’์€ ๋‹ค์Œ ํ† ํฐ(Greedy search)'๋งŒ ์ฐพ๋Š” ๊ฒƒ์ด ์ตœ์„ ์ด๋ผ ์—ฌ๊ฒผ์œผ๋‚˜, ํ˜„๋Œ€ ์ •์ฑ…์€ ๋‹ค์–‘์„ฑ ์ •์ฑ…(Diversity)์„ ํ™•๋ณดํ•œ ๋’ค ์‚ฌํ›„ ๊ฒ€์ฆ ์ •์ฑ…(Post-verification)์„ ๊ฑฐ์น˜๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋” ๋ณต์žกํ•œ ์ถ”๋ก  ๋ฌธ์ œ ์ •์ฑ…์— ํšจ๊ณผ์ ์ž„์„ ์ฆ๋ช…ํ•จ(RL Update). - **์ •์ฑ… ๋ณ€ํ™”(RL Update)**: ์ตœ๊ทผ OpenAI o1 ๋“ฑ ์ถ”๋ก  ์ „๋ฌธ ๋ชจ๋ธ ์ •์ฑ…์€ ๋‹จ์ˆœํžˆ N๊ฐœ๋ฅผ ๋ฝ‘๋Š” ์ˆ˜์ค€์„ ๋„˜์–ด, ์ƒ๊ฐ์˜ ์ฒด์ธ(CoT) ๊ณผ์ • ์ž์ฒด๋ฅผ ๊ฒ€์ฆํ•˜๊ณ  ์ˆ˜์ •ํ•˜๋Š” ์‹œ์Šคํ…œ์œผ๋กœ ์ง„ํ™” ์ค‘์ž„. (Tree-of-Thought์™€ ์—ฐ๊ฒฐ) ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Scalability|Scalability]], [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]], Tree-of-Thought, [[Search-Strategy|Search-Strategy]], Inference - **Related Terms**: Rejection Sampling, Majority Voting, Thought-level Verifiers. ---