--- id: RLHF-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 1.0 tags: [ai, llm, reinforcement-learning, alignment, rlhf] last_reinforced: 2026-04-26 --- # Reinforcement Learning from Human Feedback (RLHF) ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "์ธ๊ฐ„์˜ ์„ ํ˜ธ๋„๋ฅผ AI์˜ ๋‚˜์นจ๋ฐ˜์œผ๋กœ ์‚ผ์•„๋ผ" โ€” ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ์‚ฌ๋žŒ์ด ๋งค๊ธด ์ ์ˆ˜๋‚˜ ์ˆœ์œ„๋ฅผ ๋ณด์ƒ ํ•จ์ˆ˜(Reward Model)๋กœ ํ•™์Šต์‹œ์ผœ, AI๊ฐ€ ์ธ๊ฐ„์˜ ์˜๋„์™€ ๊ฐ€์น˜์— ๋ถ€ํ•ฉํ•˜๋„๋ก ์ •๋ ฌ(Alignment)ํ•˜๋Š” ๊ธฐ์ˆ . ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** ์ˆ˜ํ•™์ ์œผ๋กœ ์ •์˜ํ•˜๊ธฐ ์–ด๋ ค์šด '์œ ์šฉํ•จ', '๋ฌดํ•ดํ•จ', '์ •ํ™•ํ•จ'๊ณผ ๊ฐ™์€ ์ถ”์ƒ์ ์ธ ๊ฐ€์น˜๋ฅผ ์ธ๊ฐ„์˜ ์ง์ ‘์ ์ธ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์— ์ฃผ์ž…ํ•˜๋Š” 3๋‹จ๊ณ„ ์ •๋ ฌ ํŒจํ„ด. - **์„ธ๋ถ€ ํ”„๋กœ์„ธ์Šค:** - **1. Pre-training:** ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์–ธ์–ด์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ ํ•™์Šต. - **2. Reward Modeling:** ๋ชจ๋ธ์˜ ์—ฌ๋Ÿฌ ๋‹ต๋ณ€ ํ›„๋ณด์— ๋Œ€ํ•ด ์ธ๊ฐ„์ด ์ˆœ์œ„๋ฅผ ๋งค๊ธฐ๊ณ , ์ด ์„ ํ˜ธ๋„๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ณ„๋„์˜ ๋ณด์ƒ ๋ชจ๋ธ ํ•™์Šต. - **3. RL Fine-tuning:** ๋ณด์ƒ ๋ชจ๋ธ์ด ๋†’์€ ์ ์ˆ˜๋ฅผ ์ฃผ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ PPO์™€ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ ๋ชจ๋ธ(Policy)์„ ์—…๋ฐ์ดํŠธ. - **์˜์˜:** ๋‹จ์ˆœํ•œ ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก์„ ๋„˜์–ด, ์ธ๊ฐ„๊ณผ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์•ˆ์ „ํ•˜๊ฒŒ ๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์ฑ—๋ด‡(ChatGPT ๋“ฑ) ํƒ„์ƒ์˜ ํ•ต์‹ฌ ๋™๋ ฅ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ์ดˆ๊ธฐ ๊ฐ•ํ™”ํ•™์Šต์€ ๊ฒŒ์ž„ ์ ์ˆ˜ ๋“ฑ ๋ช…ํ™•ํ•œ ๋ณด์ƒ ์ง€ํ‘œ๊ฐ€ ํ•„์š”ํ–ˆ์œผ๋‚˜, RLHF๋Š” ์ธ๊ฐ„์˜ ์ฃผ๊ด€์  ํ”ผ๋“œ๋ฐฑ์„ ๋ณด์ƒ์œผ๋กœ ์Šนํ™”์‹œํ‚ด์œผ๋กœ์จ ์ ์šฉ ๋ฒ”์œ„๋ฅผ ๋ฌดํ•œํžˆ ๋„“ํž˜. - **์ •์ฑ… ๋ณ€ํ™”:** Antigravity ์—์ด์ „ํŠธ๋Š” ์‚ฌ์šฉ์ž์˜ 'Thumbs Up/Down' ํ”ผ๋“œ๋ฐฑ์„ ์ˆ˜์ง‘ํ•˜์—ฌ ๋กœ์ปฌ ๋ธŒ๋ ˆ์ธ์˜ ๋‹ต๋ณ€ ์Šคํƒ€์ผ์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ต์ •ํ•˜๋Š” Mini-RLHF ๋ฃจํ”„๋ฅผ ์šด์šฉํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Reinforcement-Learning|Reinforcement-Learning]], [[Alignment|Alignment]], [[LLM|LLM]], PPO, AI-Safety - **Raw Source:** 10_Wiki/Topics/AI/Reinforcement-Learning-from-Human-Feedback-RLHF.md