--- id: RL-PPO-001 category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 1.0 tags: [ai, reinforcement-learning, ppo, proximal-policy-optimization, openai, stability, policy-gradient] last_reinforced: 2026-04-26 --- # Proximal Policy Optimization (PPO, 근사 μ •μ±… μ΅œμ ν™”) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "μ •μ±…μ˜ κΈ‰κ²©ν•œ λ³€ν™”λ₯Ό '클리핑(Clipping)'μ΄λΌλŠ” κ³ μ‚λ‘œ μ–΅μ œν•˜μ—¬, λ³΅μž‘ν•œ ν™˜κ²½μ—μ„œλ„ λ¬΄λ„ˆμ§€μ§€ μ•ŠλŠ” μ•ˆμ •μ μΈ μ§€λŠ₯의 μ„±μž₯을 κ²¬μΈν•˜λΌ" β€” OpenAIκ°€ μ œμ•ˆν•œ κ°•ν™”ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μœΌλ‘œ, μ •μ±… μ—…λ°μ΄νŠΈ 폭을 μ œν•œν•¨μœΌλ‘œμ¨ ν•™μŠ΅μ˜ μ•ˆμ •μ„±κ³Ό νš¨μœ¨μ„±μ„ λ™μ‹œμ— λ‹¬μ„±ν•œ ν˜„λŒ€ RL의 ν‘œμ€€ 기법. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** "Clipped Surrogate Objective and Stability-First Learning" β€” κΈ°μ‘΄ μ •μ±…κ³Ό μƒˆλ‘œμš΄ μ •μ±… μ‚¬μ΄μ˜ λΉ„μœ¨μ΄ νŠΉμ • λ²”μœ„λ₯Ό λ„˜μ§€ μ•Šλ„λ‘ κ°•μ œλ‘œ μ œν•œ(Clipped)ν•¨μœΌλ‘œμ¨, 단 ν•œ 번의 잘λͺ»λœ μ—…λ°μ΄νŠΈλ‘œ λͺ¨λΈ 전체가 λ§κ°€μ§€λŠ” ν˜„μƒμ„ λ°©μ§€ν•˜λŠ” νŒ¨ν„΄. - **핡심 λ©”μ»€λ‹ˆμ¦˜:** - **Clipped Objective:** μ •μ±… λ³€ν™”μœ¨μ„ [0.8, 1.2] μˆ˜μ€€μœΌλ‘œ λ¬Άμ–΄ κΈ‰κ²©ν•œ λ³€ν™” μ–΅μ œ. - **Actor-Critic μ•„ν‚€ν…μ²˜:** 행동을 κ²°μ •ν•˜λŠ” Actor와 κ°€μΉ˜λ₯Ό ν‰κ°€ν•˜λŠ” Critic을 ν•¨κ»˜ ν•™μŠ΅. - **Multi-epoch Update:** μˆ˜μ§‘λœ 데이터λ₯Ό μ—¬λŸ¬ 번 μž¬μ‚¬μš©ν•˜μ—¬ μƒ˜ν”Œ νš¨μœ¨μ„± μ¦λŒ€. - **의의:** κ΅¬ν˜„μ΄ 비ꡐ적 λ‹¨μˆœν•˜λ©΄μ„œλ„ μžμœ¨μ£Όν–‰, λ‘œλ΄‡ μ œμ–΄, κ²Œμž„ AI, 그리고 LLM의 RLHF(인간 ν”Όλ“œλ°± 기반 κ°•ν™”ν•™μŠ΅) λ“± μ΅œμ²¨λ‹¨ λΆ„μ•Όμ—μ„œ κ°€μž₯ 널리 μ“°μ΄λŠ” 신뒰도 높은 μ•Œκ³ λ¦¬μ¦˜. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** μˆ˜ν•™μ μœΌλ‘œλŠ” 더 μ—„λ°€ν•˜μ§€λ§Œ κ΅¬ν˜„μ΄ 맀우 λ³΅μž‘ν–ˆλ˜ TRPO(Trust Region Policy Optimization)λ₯Ό 싀전적인 근사 κΈ°λ²•μœΌλ‘œ λŒ€μ²΄ν•˜λ©°, '이둠적 완벽함'보닀 '싀전적 견고함'이 더 μ€‘μš”ν•˜λ‹€λŠ” 것을 μž…μ¦ν•¨. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈλŠ” μ—μ΄μ „νŠΈμ˜ 볡합 μ˜μ‚¬κ²°μ • μ „λž΅ μ΅œμ ν™” μ‹œ, ν•™μŠ΅μ˜ λ°œμ‚° μœ„ν—˜μ΄ 적고 νŠœλ‹μ΄ μš©μ΄ν•œ PPOλ₯Ό μ£Όλ ₯ μ•Œκ³ λ¦¬μ¦˜μœΌλ‘œ 채택함. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Policy-Gradient-Methods|Policy-Gradient-Methods]], [[Actor-Critic-Models|Actor-Critic-Models]], [[Off-policy-vs-On-policy-Learning|Off-policy-vs-On-policy-Learning]], [[Reinforcement-Learning|Reinforcement-Learning]] - **Raw Source:** 10_Wiki/Topics/AI/Proximal-Policy-Optimization.md