--- id: PG-METHOD-001 category: "[[10_Wiki/๐Ÿ’ก Topics/AI]]" confidence_score: 1.0 tags: [reinforcement-learning, ai, policy-gradient, optimization] last_reinforced: 2026-04-26 --- # [[Policy Gradient Methods (์ •์ฑ… ๊ฒฝ์‚ฌ๋ฒ•)]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "ํ–‰๋™์˜ ๊ฐ€์น˜๋ฅผ ๊ณ„์‚ฐํ•˜์ง€ ๋ง๊ณ , ์ข‹์€ ํ–‰๋™์˜ ํ™•๋ฅ ์„ ์ง์ ‘ ๋†’์—ฌ๋ผ" โ€” ๊ฐ€์น˜ ํ•จ์ˆ˜(Value Function)๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ์‹ ๊ฒฝ๋ง์ด ์ง์ ‘ ์ •์ฑ…($\pi$)์„ ์ถœ๋ ฅํ•˜๊ฒŒ ํ•˜์—ฌ, ๊ธฐ๋Œ€ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ •์ฑ…์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ•. ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) - **์ถ”์ถœ๋œ ํŒจํ„ด:** ์—์ด์ „ํŠธ๊ฐ€ ์ˆ˜ํ–‰ํ•œ ํ–‰๋™ ์‹œํ€€์Šค๊ฐ€ ๋†’์€ ๋ณด์ƒ์„ ๋ฐ›์œผ๋ฉด ํ•ด๋‹น ํ–‰๋™๋“ค์ด ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์„ ๋†’์ด๊ณ , ๋‚ฎ์€ ๋ณด์ƒ์„ ๋ฐ›์œผ๋ฉด ๋‚ฎ์ถ”๋Š” ๋ฐฉ์‹์œผ๋กœ ์ตœ์ ์˜ ์ „๋žต์„ ์ง์ ‘ ํƒ์ƒ‰ํ•˜๋Š” ํŒจํ„ด. - **์„ธ๋ถ€ ๋‚ด์šฉ:** - **Stochastic Policy:** ํ–‰๋™์„ ํ™•๋ฅ ์ ์œผ๋กœ ๊ฒฐ์ •ํ•˜๋ฏ€๋กœ ํƒ์ƒ‰(Exploration)์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ด๋ฃจ์–ด์ง. - **High-dimensional Action Spaces:** ๊ฐ€์น˜ ๊ธฐ๋ฐ˜ ๋ฐฉ์‹(DQN ๋“ฑ)๊ณผ ๋‹ฌ๋ฆฌ ์—ฐ์†์ ์ด๊ฑฐ๋‚˜ ๋งค์šฐ ํฐ ์•ก์…˜ ๊ณต๊ฐ„์—์„œ๋„ ํšจ๊ณผ์ ์ž„. - **REINFORCE Algorithm:** ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ์ •์ฑ… ๊ฒฝ์‚ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜. ์—ํ”ผ์†Œ๋“œ ์ „์ฒด์˜ ๋ณด์ƒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์—…๋ฐ์ดํŠธ. - **Variance Problem:** ๋ณด์ƒ์˜ ๋ณ€๋™์„ฑ์ด ์ปค ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฒ ์ด์Šค๋ผ์ธ(Baseline)์ด๋‚˜ ์•กํ„ฐ-ํฌ๋ฆฌํ‹ฑ(Actor-Critic) ๊ตฌ์กฐ ์‚ฌ์šฉ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & RL Update) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ:** ๊ฐ€์น˜ ํ•จ์ˆ˜ ํ•™์Šต์—๋งŒ ์ง‘์ค‘ํ•˜๋˜ ์ดˆ๊ธฐ ๊ฐ•ํ™”ํ•™์Šต์—์„œ, ๋” ๋ณต์žกํ•˜๊ณ  ์œ ์—ฐํ•œ ํ–‰๋™ ์ œ์–ด๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ •์ฑ… ๊ธฐ๋ฐ˜ ํ•™์Šต์œผ๋กœ ์ค‘์‹ฌ์ถ•์ด ์ด๋™. - **์ •์ฑ… ๋ณ€ํ™”:** Skybound์˜ ๋ณด์Šค AI ํ•™์Šต ์‹œ, ๋ณต์žกํ•œ ํŒจํ„ด์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด PPO(Proximal Policy Optimization)์™€ ๊ฐ™์€ ๊ณ ๋„ํ™”๋œ ์ •์ฑ… ๊ฒฝ์‚ฌ๋ฒ•์„ ์ ์šฉํ•จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Reinforcement-Learning]], [[Actor-Critic-Methods]], [[Q-Learning]], [[PPO]] - **Raw Source:** [[10_Wiki/Topics/AI/Policy-Gradient-Methods.md]]