--- id: [[P-Reinforce|P-Reinforce]]-AUTO-GRPO-001 category: Unified confidence_score: 0.94 tags: [auto-reinforced, grpo, [[Reinforcement-Learning|Reinforcement-Learning]], llm, [[Optimization|Optimization]], ppo, [[Deep-Learning|Deep-Learning]], deepseek] last_reinforced: 2026-04-20 --- # [[GRPO|GRPO]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "비평가 μ—†λŠ” ν•™μŠ΅μ˜ 효율: κΈ°μ‘΄ PPOκ°€ νŒλ³„μž(Critic) λͺ¨λΈμ„ λ”°λ‘œ 두어 μ—°μ‚°λŸ‰μ΄ λ§Žμ•˜λ˜ 것과 달리, ν•˜λ‚˜μ˜ 행동 집단(Group) μ•ˆμ—μ„œ μƒλŒ€μ μΈ μ„±κ³Όλ₯Ό κ³„μ‚°ν•˜μ—¬ 훨씬 적은 μžμ›μœΌλ‘œ λŒ€ν˜• μ–Έμ–΄ λͺ¨λΈμ„ λΉ„μ•½μ μœΌλ‘œ λ˜‘λ˜‘ν•˜κ²Œ λ§Œλ“œλŠ” μ΅œμ‹  κ°•ν™”ν•™μŠ΅ 기법." ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) GRPO(Group Relative Policy Optimization)λŠ” DeepSeek-V3 λ“± μ΅œμ‹  κ±°λŒ€ μ–Έμ–΄ λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λœ κ°•ν™”ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μž…λ‹ˆλ‹€. 1. **ν˜μ‹  지점**: * **No Critic Model**: κΈ°μ‘΄ PPO의 핡심인 κ°€μΉ˜ ν•¨μˆ˜(Value function) λͺ¨λΈμ„ μ œκ±°ν•˜μ—¬ VRAM μ ˆμ•½. ([[Efficiency|Efficiency]]와 μ—°κ²°) * **Relative Reward**: λ™μΌν•œ ν”„λ‘¬ν”„νŠΈμ— λŒ€ν•΄ μ—¬λŸ¬ 닡변을 생성(Group)ν•˜κ³ , κ·Έ λ‹΅λ³€λ“€μ˜ 평균 점수λ₯Ό κΈ°μ€€μœΌλ‘œ 각 λ‹΅λ³€μ˜ μš°μœ„λ₯Ό 평가(Relative)ν•˜μ—¬ μ •μ±… μ—…λ°μ΄νŠΈ. 2. **μ™œ μ€‘μš”ν•œκ°€?**: * AI ν•™μŠ΅ λΉ„μš© 정책이 κΈ°ν•˜κΈ‰μˆ˜μ μœΌλ‘œ λŠ˜μ–΄λ‚˜λŠ” μƒν™©μ—μ„œ, μ•Œκ³ λ¦¬μ¦˜μ  νš¨μœ¨μ„± μ •μ±…λ§ŒμœΌλ‘œ κ³ μ„±λŠ₯ μΆ”λ‘  λͺ¨λΈ μ •μ±…([[Reasoning|Reasoning]] models)을 효율적으둜 λ§Œλ“€ 수 μžˆμŒμ„ 증λͺ…ν–ˆκΈ° λ•Œλ¬Έμž„. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌**: κ³Όκ±° PPO 정책이 κ°•ν™”ν•™μŠ΅μ˜ 'κ³¨λ“œ ν‘œμ€€'μ΄μ—ˆμœΌλ‚˜, GRPO 정책은 λŒ€κ·œλͺ¨ λΆ„μ‚° ν•™μŠ΅ μ •μ±… ν™˜κ²½μ—μ„œ 톡계적 μƒλŒ€ 평가 정책이 κ°œλ³„ κ°€μΉ˜ μΆ”μ • 정책보닀 훨씬 μ•ˆμ •μ ([[Reliability|Reliability]])일 수 μžˆμŒμ„ λ³΄μ—¬μ€Œ(RL Update). - **μ •μ±… λ³€ν™”(RL Update)**: μ΄μ œλŠ” λ‹¨μˆœ μ–Έμ–΄ λͺ¨λΈ 정책을 λ„˜μ–΄, λ³΅μž‘ν•œ 닀단계 μΆ”λ‘  μ •μ±…(Multi-step reasoning)이 ν•„μš”ν•œ μˆ˜ν•™μ΄λ‚˜ μ½”λ”© μ „λ¬Έ λͺ¨λΈ 정책을 ν•™μŠ΅μ‹œν‚€λŠ” 데 ν•„μˆ˜μ μΈ 기술둜 자리 μž‘λŠ” μ€‘μž„. (Reasoning와 μ—°κ²°) ## πŸ”— 지식 μ—°κ²° (Graph) - [[Efficiency|Efficiency]], [[Reliability|Reliability]], [[Reasoning|Reasoning]], [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]], Deep Learning (DL), [[Optimization|Optimization]] - **Key Origin**: DeepSeek AI. ---