--- id: wiki-2026-0508-policy-optimization title: Policy Optimization category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [P-Reinforce-AUTO-POLO-001] duplicate_of: none source_trust_level: A confidence_score: 0.98 tags: [auto-reinforced, Reinforcement-Learning, Optimization, policy-gradient, ai-training] raw_sources: [] last_reinforced: 2026-04-20 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) tech_stack: language: unspecified framework: unspecified --- # [[Policy-Optimization|Policy-Optimization]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "ํ–‰๋™ ์ง€์นจ์˜ ์ง„ํ™”: ์‹œํ–‰์ฐฉ์˜ค์™€ ๋ณด์ƒ์„ ํ†ตํ•ด ์—์ด์ „ํŠธ๊ฐ€ ์–ด๋–ค ์ƒํ™ฉ์—์„œ ์–ด๋–ค ์„ ํƒ์„ ํ•˜๋Š” ๊ฒƒ์ด ์ตœ์„ ์ธ์ง€(Policy)๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ์ •๊ตํ•˜๊ฒŒ ๋‹ค๋“ฌ์–ด๊ฐ€๋Š” ๊ฐ•ํ™”ํ•™์Šต์˜ ์‹ฌ์žฅ." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ์ •์ฑ… ์ตœ์ ํ™”(Policy Optimization)๋Š” ๊ฐ•ํ™”ํ•™์Šต(RL)์—์„œ ์—์ด์ „ํŠธ์˜ ๊ฒฐ์ • ์ง€์นจ์ธ '์ •์ฑ…'์„ ์ง์ ‘ ํ•™์Šต์‹œ์ผœ ๊ธฐ๋Œ€ ๋ˆ„์  ๋ณด์ƒ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ž…๋‹ˆ๋‹ค. 1. **ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ (Policy Gradient)**: * ํŠน์ • ํ–‰๋™์„ ํ–ˆ์„ ๋•Œ ๋†’์€ ๋ณด์ƒ์„ ๋ฐ›์œผ๋ฉด ํ•ด๋‹น ํ–‰๋™์„ ํ•  ํ™•๋ฅ ์„ ๋†’์ด๊ณ , ๋‚ฎ์€ ๋ณด์ƒ์„ ๋ฐ›์œผ๋ฉด ํ™•๋ฅ ์„ ๋‚ฎ์ถ”๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ. * $\nabla J(\theta) \approx \mathbb{E} [\nabla \log \pi_\theta(a|s) R]$ 2. **์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜**: * **REINFORCE**: ๋ณด์ƒ์˜ ์ „์ฒด ํ•ฉ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ์ดˆ์ ์ธ ์ •์ฑ… ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฐฉ์‹. * **PPO (Proximal Policy Optimization)**: ๊ธ‰๊ฒฉํ•œ ์ •์ฑ… ๋ณ€ํ™”๋ฅผ ์–ต์ œ([[CLIP|CLIP]]ping)ํ•˜์—ฌ ํ•™์Šต์˜ ์•ˆ์ •์„ฑ์„ ํš๊ธฐ์ ์œผ๋กœ ๋†’์ธ ์˜คํ”ˆAI์˜ ํ‘œ์ค€ ์•Œ๊ณ ๋ฆฌ์ฆ˜. * **TRPO (Trust Region Policy Optimization)**: ์ •์ฑ… ๋ณ€ํ™”๋Ÿ‰์„ ์‹ ๋ขฐ ์˜์—ญ ๋‚ด๋กœ ์ œํ•œํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์žฅ. 3. **์žฅ์ **: * ์—ฐ์†์ ์ธ ํ–‰๋™ ๊ณต๊ฐ„(์˜ˆ: ๋กœ๋ด‡ ํŒ” ์กฐ์ ˆ) ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ํƒ์›”ํ•จ. * ํ™•๋ฅ ์  ์ •์ฑ…(Stochastic Policy)์„ ํ†ตํ•ด ํƒํ—˜(Exploration)์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ˆ˜ํ–‰. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & Updates) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ**: ๊ณผ๊ฑฐ์—๋Š” ๊ฐ€์น˜ ๊ธฐ๋ฐ˜(Q-Learning) ๋ฐฉ์‹์ด ์ฃผ๋ฅ˜์˜€์œผ๋‚˜, ๋ณต์žกํ•œ ํ˜„์‹ค ์„ธ๊ณ„์˜ ๋ฌธ์ œ๋Š” ๊ฐ€์น˜ ํ•จ์ˆ˜๋กœ๋งŒ ์„ค๋ช…ํ•˜๊ธฐ ์–ด๋ ค์›Œ ์ •์ฑ… ์ง์ ‘ ์ตœ์ ํ™” ๋ฐฉ์‹์ด ํ˜„๋Œ€ AI์˜ ๋Œ€์„ธ๊ฐ€ ๋จ. - **์ •์ฑ… ๋ณ€ํ™”(RL Update)**: ์ •์ฑ… ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” '๋ณด์ƒ ํ•ดํ‚น(Reward Hacking)'์ด๋‚˜ '์•ˆ์ „ ์œ„๋ฐฐ'๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด, ์ œ์•ฝ ์กฐ๊ฑด์„ ์ˆ˜์‹์— ์ง์ ‘ ํฌํ•จํ•˜๋Š” 'Safe RL' ์ •์ฑ…์ด ์ž์œจ ์ฃผํ–‰ ๋ฐ ์˜๋ฃŒ AI ํ•™์Šต์˜ ํ•„์ˆ˜ ๊ทœ์ •์œผ๋กœ ๋„์ž…๋จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]], Policy Gradient Methods, [[Optimization|Optimization]], Machine Learning, PPO (Proximal Policy Optimization) - **Modern Tech/Tools**: OpenAI Spinning Up, Stable Baselines3, Ray Rllib. --- ## ๐Ÿค– LLM ํ™œ์šฉ ํžŒํŠธ (How to Use This Knowledge) **์–ธ์ œ ์ด ์ง€์‹์„ ์“ฐ๋Š”๊ฐ€:** - *(TODO)* **์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€:** - *(TODO)* ## ๐Ÿงช ๊ฒ€์ฆ ์ƒํƒœ (Validation) - **์ •๋ณด ์ƒํƒœ:** needs_review - **์ถœ์ฒ˜ ์‹ ๋ขฐ๋„:** A - **๊ฒ€ํ†  ์ด์œ :** *(P-Reinforce Phase 1 ์ž๋™ ์ •๊ทœํ™”. ๋ณธ๋ฌธ ๊ฒ€์ฆ ํ•„์š”.)* ## ๐Ÿงฌ ์ค‘๋ณต ๊ฒ€์‚ฌ (Duplicate Check) - **๊ธฐ์กด ์œ ์‚ฌ ๋ฌธ์„œ:** *(TODO: ์ธ๋ฑ์„œ ํด๋Ÿฌ์Šคํ„ฐ ๋ฆฌํฌํŠธ ์ฐธ์กฐ)* - **์ฒ˜๋ฆฌ ๋ฐฉ์‹:** UPDATE (์ž๋™ ์ •๊ทœํ™”) - **์ฒ˜๋ฆฌ ์ด์œ :** Phase 1 ์ •๊ทœํ™” โ€” ์˜› ํ…œํ”Œ๋ฆฟ/๋ˆ„๋ฝ ํ•„๋“œ ๋ณด๊ฐ•. ## ๐Ÿ•“ ๋ณ€๊ฒฝ ์ด๋ ฅ (Changelog) | ๋‚ ์งœ | ๋ณ€๊ฒฝ ๋‚ด์šฉ | ์ฒ˜๋ฆฌ ๋ฐฉ์‹ | ์‹ ๋ขฐ๋„ | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 ์ •๊ทœํ™” (frontmatter + ํ—ค๋” ํ‘œ์ค€ํ™”) | UPDATE | A | ## ๐Ÿ’ป ์ฝ”๋“œ ํŒจํ„ด (Code Patterns) **ํŒจํ„ด 1:** *(TODO: ์ด ํ”„๋กœ์ ํŠธ ์ปจ๋ฒค์…˜ ๋ฐ˜์˜ํ•œ ๊ตฌ์กฐ ์Šค์ผˆ๋ ˆํ†ค)* ```text # TODO ``` ## ๐Ÿค” ์˜์‚ฌ๊ฒฐ์ • ๊ธฐ์ค€ (Decision Criteria) **์„ ํƒ A๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **์„ ํƒ B๋ฅผ ์จ์•ผ ํ•  ๋•Œ:** - *(TODO)* **๊ธฐ๋ณธ๊ฐ’:** > *(TODO)* ## โŒ ์•ˆํ‹ฐํŒจํ„ด (Anti-Patterns) - **[์•ˆํ‹ฐํŒจํ„ด]:** *(TODO: ๋ฌด์—‡์„ ํ•˜๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€ + ์ด์œ  + ๋Œ€์‹  ๋ฌด์—‡์„)*