--- id: RL-PG-001 category: "10_Wiki/๐Ÿ’ก Topics/AI" confidence_score: 1.0 tags: [ai, [[Reinforcement-Learning|Reinforcement-Learning]], policy-gradient, reinforce, ppo, trpo, continuous-control] last_reinforced: 2026-04-26 --- # Policy Gradient Methods (์ •์ฑ… ๊ฒฝ์‚ฌ ๊ธฐ๋ฒ•) ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "ํ–‰๋™์˜ ๊ฐ€์น˜๋ฅผ ๋ฌป์ง€ ๋ง๊ณ , ์Šน๋ฆฌ๋กœ ์ด๋„๋Š” '์„ ํƒ์˜ ํ™•๋ฅ ' ์ž์ฒด๋ฅผ ์ง์ ‘์ ์œผ๋กœ ๊ฐ•ํ™”ํ•˜๋ผ" โ€” ์—์ด์ „ํŠธ์˜ ์ •์ฑ…(Policy)์„ ๋งค๊ฐœ๋ณ€์ˆ˜ํ™”๋œ ํ•จ์ˆ˜๋กœ ์ •์˜ํ•˜๊ณ , ๊ธฐ๋Œ€ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ •์ฑ…์˜ ๊ฒฝ์‚ฌ(Gradient)๋ฅผ ๋”ฐ๋ผ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฐ•ํ™”ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฐ. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** "Direct Policy [[Optimization|Optimization]] and Log-probability Scaling" โ€” ์ข‹์€ ๋ณด์ƒ์„ ๊ฐ€์ ธ์˜จ ํ–‰๋™์˜ ๋ฐœ์ƒ ํ™•๋ฅ ($\log \pi$)์€ ๋†’์ด๊ณ , ๋‚˜์œ ๊ฒฐ๊ณผ์˜ ํ™•๋ฅ ์€ ๋‚ฎ์ถ”๋Š” ์ˆ˜์น˜์  ์—…๋ฐ์ดํŠธ๋ฅผ ํ†ตํ•ด, ๋ชจ๋ธ์ด ์ ์ง„์ ์œผ๋กœ ์ตœ์ ์˜ ํ–‰๋™ ์‹œํ€€์Šค๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ํ•˜๋Š” ํŒจํ„ด. - **์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜:** - **REINFORCE:** ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๋๋‚œ ๋’ค ์ „์ฒด ๋ณด์ƒ์„ ๋ฐ”ํƒ•์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ธฐ์ดˆ์  ๊ธฐ๋ฒ•. - **PPO (Proximal Policy Optimization):** ์ •์ฑ… ๋ณ€ํ™”๋Ÿ‰์„ ์ œํ•œํ•˜์—ฌ ํ•™์Šต์˜ ์•ˆ์ •์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•œ ํ˜„๋Œ€ ํ‘œ์ค€ ๊ธฐ๋ฒ•. - **Actor-Critic:** ์ •์ฑ…์„ ๊ฒฐ์ •ํ•˜๋Š” Actor์™€ ๊ทธ ๊ฐ€์น˜๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” Critic์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ถ„์‚ฐ์„ ์ค„์ž„. - **์˜์˜:** ํ–‰๋™ ๊ณต๊ฐ„์ด ๋ฌดํ•œํžˆ ๋„“์€ ์—ฐ์†์  ์ œ์–ด([[Robotics|Robotics]]) ๋ฌธ์ œ์—์„œ ํƒ์›”ํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๋ฉฐ, ์ธ๊ฐ„์˜ ์„ ํ˜ธ๋„๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” RLHF(Reinforcement Learning from Human Feedback)์˜ ํ•ต์‹ฌ ์—”์ง„์œผ๋กœ ํ™œ์šฉ๋จ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ๊ฐ€์น˜ ๊ธฐ๋ฐ˜(Q-Learning ๋“ฑ) ๊ธฐ๋ฒ•๋ณด๋‹ค ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์ด ๋–จ์–ด์ง„๋‹ค๋Š” ๋น„ํŒ์ด ์žˆ์—ˆ์œผ๋‚˜, PPO์™€ ๊ฐ™์€ ์•ˆ์ •์ ์ธ ์—…๋ฐ์ดํŠธ ๊ธฐ๋ฒ•๊ณผ ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ƒ˜ํ”Œ๋ง์ด ๊ฒฐํ•ฉ๋˜๋ฉด์„œ ํ˜„๋Œ€ ์ดˆ๊ฑฐ๋Œ€ AI ๋ชจ๋ธ ํŠœ๋‹์˜ ํ•„์ˆ˜ ๊ธฐ์ˆ ๋กœ ์ž๋ฆฌ๋งค๊น€ํ•จ. - **์ •์ฑ… ๋ณ€ํ™”:** Antigravity ํ”„๋กœ์ ํŠธ๋Š” ์—์ด์ „ํŠธ์˜ ๋ณตํ•ฉ ์ž‘์—… ๊ณ„ํš ์ˆ˜๋ฆฝ ์‹œ, ๊ฐ ๋‹จ๊ณ„๋ณ„ ๋„๊ตฌ ์„ ํƒ ํ™•๋ฅ ์„ ์ตœ์ ํ™”ํ•˜๊ธฐ ์œ„ํ•ด PPO ๊ธฐ๋ฐ˜์˜ ์ •์ฑ… ๊ฒฝ์‚ฌ ํ•™์Šต ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Reinforcement-Learning|Reinforcement-Learning]], Proximal-[[Policy-Optimization|Policy-Optimization]]-PPO, [[Actor-Critic-Models|Actor-Critic-Models]], [[Off-policy-vs-On-policy-Learning|Off-policy-vs-On-policy-Learning]] - **Raw Source:** 10_Wiki/Topics/AI/Policy-Gradient-Methods.md