--- id: wiki-2026-0508-grpo title: GRPO category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [P-Reinforce-AUTO-GRPO-001] duplicate_of: none source_trust_level: A confidence_score: 0.94 tags: [auto-reinforced, grpo, Reinforcement-Learning, llm, Optimization, ppo, Deep-Learning, deepseek] raw_sources: [] last_reinforced: 2026-04-20 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) --- # [[GRPO|GRPO]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "비평가 μ—†λŠ” ν•™μŠ΅μ˜ 효율: κΈ°μ‘΄ PPOκ°€ νŒλ³„μž(Critic) λͺ¨λΈμ„ λ”°λ‘œ 두어 μ—°μ‚°λŸ‰μ΄ λ§Žμ•˜λ˜ 것과 달리, ν•˜λ‚˜μ˜ 행동 집단(Group) μ•ˆμ—μ„œ μƒλŒ€μ μΈ μ„±κ³Όλ₯Ό κ³„μ‚°ν•˜μ—¬ 훨씬 적은 μžμ›μœΌλ‘œ λŒ€ν˜• μ–Έμ–΄ λͺ¨λΈμ„ λΉ„μ•½μ μœΌλ‘œ λ˜‘λ˜‘ν•˜κ²Œ λ§Œλ“œλŠ” μ΅œμ‹  κ°•ν™”ν•™μŠ΅ 기법." ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) GRPO(Group Relative Policy Optimization)λŠ” DeepSeek-V3 λ“± μ΅œμ‹  κ±°λŒ€ μ–Έμ–΄ λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λœ κ°•ν™”ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μž…λ‹ˆλ‹€. 1. **ν˜μ‹  지점**: * **No Critic Model**: κΈ°μ‘΄ PPO의 핡심인 κ°€μΉ˜ ν•¨μˆ˜(Value function) λͺ¨λΈμ„ μ œκ±°ν•˜μ—¬ VRAM μ ˆμ•½. ([[Efficiency|Efficiency]]와 μ—°κ²°) * **Relative Reward**: λ™μΌν•œ ν”„λ‘¬ν”„νŠΈμ— λŒ€ν•΄ μ—¬λŸ¬ 닡변을 생성(Group)ν•˜κ³ , κ·Έ λ‹΅λ³€λ“€μ˜ 평균 점수λ₯Ό κΈ°μ€€μœΌλ‘œ 각 λ‹΅λ³€μ˜ μš°μœ„λ₯Ό 평가(Relative)ν•˜μ—¬ μ •μ±… μ—…λ°μ΄νŠΈ. 2. **μ™œ μ€‘μš”ν•œκ°€?**: * AI ν•™μŠ΅ λΉ„μš© 정책이 κΈ°ν•˜κΈ‰μˆ˜μ μœΌλ‘œ λŠ˜μ–΄λ‚˜λŠ” μƒν™©μ—μ„œ, μ•Œκ³ λ¦¬μ¦˜μ  νš¨μœ¨μ„± μ •μ±…λ§ŒμœΌλ‘œ κ³ μ„±λŠ₯ μΆ”λ‘  λͺ¨λΈ μ •μ±…([[Reasoning|Reasoning]] models)을 효율적으둜 λ§Œλ“€ 수 μžˆμŒμ„ 증λͺ…ν–ˆκΈ° λ•Œλ¬Έμž„. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & Updates) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌**: κ³Όκ±° PPO 정책이 κ°•ν™”ν•™μŠ΅μ˜ 'κ³¨λ“œ ν‘œμ€€'μ΄μ—ˆμœΌλ‚˜, GRPO 정책은 λŒ€κ·œλͺ¨ λΆ„μ‚° ν•™μŠ΅ μ •μ±… ν™˜κ²½μ—μ„œ 톡계적 μƒλŒ€ 평가 정책이 κ°œλ³„ κ°€μΉ˜ μΆ”μ • 정책보닀 훨씬 μ•ˆμ •μ ([[Reliability|Reliability]])일 수 μžˆμŒμ„ λ³΄μ—¬μ€Œ(RL Update). - **μ •μ±… λ³€ν™”(RL Update)**: μ΄μ œλŠ” λ‹¨μˆœ μ–Έμ–΄ λͺ¨λΈ 정책을 λ„˜μ–΄, λ³΅μž‘ν•œ 닀단계 μΆ”λ‘  μ •μ±…(Multi-step reasoning)이 ν•„μš”ν•œ μˆ˜ν•™μ΄λ‚˜ μ½”λ”© μ „λ¬Έ λͺ¨λΈ 정책을 ν•™μŠ΅μ‹œν‚€λŠ” 데 ν•„μˆ˜μ μΈ 기술둜 자리 μž‘λŠ” μ€‘μž„. (Reasoning와 μ—°κ²°) ## πŸ”— 지식 μ—°κ²° (Graph) - [[Efficiency|Efficiency]], [[Reliability|Reliability]], [[Reasoning|Reasoning]], [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]], Deep Learning (DL), [[Optimization|Optimization]] - **Key Origin**: DeepSeek AI. --- ## πŸ€– LLM ν™œμš© 힌트 (How to Use This Knowledge) **μ–Έμ œ 이 지식을 μ“°λŠ”κ°€:** - *(TODO)* **μ–Έμ œ μ“°λ©΄ μ•ˆ λ˜λŠ”κ°€:** - *(TODO)* ## πŸ§ͺ 검증 μƒνƒœ (Validation) - **정보 μƒνƒœ:** needs_review - **좜처 신뒰도:** A - **κ²€ν†  이유:** *(P-Reinforce Phase 1 μžλ™ μ •κ·œν™”. λ³Έλ¬Έ 검증 ν•„μš”.)* ## 🧬 쀑볡 검사 (Duplicate Check) - **κΈ°μ‘΄ μœ μ‚¬ λ¬Έμ„œ:** *(TODO: μΈλ±μ„œ ν΄λŸ¬μŠ€ν„° 리포트 μ°Έμ‘°)* - **처리 방식:** UPDATE (μžλ™ μ •κ·œν™”) - **처리 이유:** Phase 1 μ •κ·œν™” β€” μ˜› ν…œν”Œλ¦Ώ/λˆ„λ½ ν•„λ“œ 보강. ## πŸ•“ λ³€κ²½ 이λ ₯ (Changelog) | λ‚ μ§œ | λ³€κ²½ λ‚΄μš© | 처리 방식 | 신뒰도 | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 μ •κ·œν™” (frontmatter + 헀더 ν‘œμ€€ν™”) | UPDATE | A |