--- id: wiki-2026-0508-grpo title: GRPO (Group Relative Policy Optimization) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [GRPO, group relative policy optimization, DeepSeek R1, RL fine-tune] duplicate_of: none source_trust_level: A confidence_score: 0.92 verification_status: applied tags: [rl, grpo, deepseek, reasoning, llm-fine-tune, ppo-alternative] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: TRL / DeepSeek --- # GRPO (Group Relative Policy Optimization) ## 매 한 줄 > **"매 PPO 의 critic-free variant — 매 group 의 sample 의 의 의 baseline"**. DeepSeek 2024-2025. 매 R1 reasoning 의 enable. 매 reward model 의 의 의 X (rule-based reward 의 충분). 매 modern RLHF / reasoning 의 popular. ## 매 핵심 ### 매 vs PPO - **PPO**: 매 critic (value network). - **GRPO**: 매 group sample 의 mean 의 baseline. - **Result**: 매 simpler, 매 reasoning 의 strong. ### 매 algorithm 1. 매 prompt → 매 G rollouts (different responses). 2. 매 reward 의 매 rollout 의 score. 3. 매 advantage = (reward - group_mean) / group_std. 4. 매 PPO-style clipped objective. ### 매 famous - **DeepSeek-Math** (2024). - **DeepSeek-R1** (2025): 매 reasoning emerge. ### 매 응용 1. **Math reasoning**. 2. **Code generation**. 3. **Tool use**. 4. **Long CoT**. ## 💻 패턴 ### Basic GRPO loop ```python import torch import torch.nn.functional as F def grpo_step(policy, ref_policy, prompts, reward_fn, group_size=8, beta=0.04, eps=0.2): advantages_all = [] log_probs_old_all = [] log_probs_ref_all = [] responses_all = [] for prompt in prompts: # 매 G rollouts rollouts = [] rewards = [] for _ in range(group_size): response = policy.generate(prompt, do_sample=True) r = reward_fn(prompt, response) rollouts.append(response); rewards.append(r) rewards = torch.tensor(rewards) # 매 group baseline adv = (rewards - rewards.mean()) / (rewards.std() + 1e-8) advantages_all.extend(adv.tolist()) # 매 log prob for resp in rollouts: log_probs_old_all.append(policy.log_prob(prompt, resp).detach()) log_probs_ref_all.append(ref_policy.log_prob(prompt, resp).detach()) responses_all.append((prompt, resp)) # 매 PPO-style update for _ in range(4): # 매 ppo epochs for (prompt, resp), adv, log_old, log_ref in zip(responses_all, advantages_all, log_probs_old_all, log_probs_ref_all): log_new = policy.log_prob(prompt, resp) ratio = (log_new - log_old).exp() obj1 = ratio * adv obj2 = ratio.clamp(1 - eps, 1 + eps) * adv policy_loss = -torch.min(obj1, obj2).mean() # 매 KL penalty (vs ref) kl = log_new - log_ref kl_loss = beta * kl.mean() loss = policy_loss + kl_loss loss.backward() optim.step(); optim.zero_grad() ``` ### Rule-based reward (math) ```python def math_reward(prompt, response): """매 deepseek-style: extract answer, verify.""" answer = extract_answer(response) expected = extract_answer(prompt['solution']) correctness = 1.0 if answer == expected else 0.0 format_bonus = 0.1 if has_required_format(response) else 0 return correctness + format_bonus ``` ### TRL implementation ```python from trl import GRPOTrainer, GRPOConfig trainer = GRPOTrainer( model='Qwen/Qwen2.5-7B', reward_funcs=[correctness_reward, format_reward], args=GRPOConfig( output_dir='out', num_generations=8, # 매 group size per_device_train_batch_size=1, gradient_accumulation_steps=8, learning_rate=5e-6, max_prompt_length=512, max_completion_length=1024, beta=0.04, ), train_dataset=ds, ) trainer.train() ``` ### Multi-objective reward ```python def multi_reward(prompt, response): rewards = {} rewards['correctness'] = correctness(prompt, response) rewards['format'] = check_format(response) rewards['length'] = -abs(len(response) - 500) / 1000 # 매 prefer ~500 tokens rewards['cot_quality'] = check_reasoning_quality(response) weights = {'correctness': 1.0, 'format': 0.1, 'length': 0.05, 'cot_quality': 0.3} return sum(rewards[k] * weights[k] for k in rewards) ``` ### Reasoning-focused (R1-style) ```python THINK_FORMAT = """ {reasoning} {answer} """ def r1_format_reward(response): has_think = '' in response and '' in response has_answer = '' in response and '' in response return 0.5 if (has_think and has_answer) else 0 ``` ### Self-consistency (best-of-N at eval) ```python def best_of_n_eval(model, prompt, n=16): responses = [model.generate(prompt, do_sample=True) for _ in range(n)] answers = [extract_answer(r) for r in responses] # 매 majority vote from collections import Counter return Counter(answers).most_common(1)[0][0] ``` ### KL control ```python def adaptive_beta(target_kl, current_kl, beta): if current_kl > 1.5 * target_kl: return beta * 1.5 if current_kl < 0.5 * target_kl: return beta / 1.5 return beta ``` ### Reward hacking detection ```python def detect_reward_hacking(rollouts, rewards): """매 high reward 의 의 의 quality 의 X?""" high_reward = [r for r, score in zip(rollouts, rewards) if score > 0.9] quality = [llm_judge_quality(r) for r in high_reward] if np.mean(quality) < 0.5: return 'WARN: high reward but low quality — possibly hacking' return None ``` ### Process reward (PRM) ```python def process_reward(steps): """매 step-by-step verify.""" return sum(prm_score(step) for step in steps) / len(steps) ``` ### Iterative training (R1-style) ```python def r1_pipeline(base_model, dataset): # 매 stage 1: reasoning data SFT sft_model = sft(base_model, reasoning_data) # 매 stage 2: GRPO grpo_model = grpo(sft_model, dataset, math_reward) # 매 stage 3: rejection sampling — 매 high-quality 의 SFT 다시 rs_data = filter_high_quality(grpo_model.generate_many(dataset)) final = sft(grpo_model, rs_data) return final ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Reasoning task | GRPO + rule reward | | Preference align | DPO / PPO | | Code | GRPO + execution reward | | General chat | RLHF / DPO | | Tool use | GRPO + success reward | | Cost-aware | GRPO (no critic) | **기본값**: 매 reasoning = GRPO + rule + format reward + iterative + KL control. ## 🔗 Graph - 부모: [[RLHF]] · [[Reinforcement-Learning]] - 변형: [[PPO]] · [[DPO]] - 응용: [[DeepSeek-R1]] - Adjacent: [[Fine-tuning]] · [[Foundation-Models]] ## 🤖 LLM 활용 **언제**: 매 reasoning, math, code. 매 verifiable reward. **언제 X**: 매 subjective preference (use DPO). ## ❌ 안티패턴 - **No KL control**: 매 reward hack drift. - **Tiny group**: 매 noisy advantage. - **No rule for format**: 매 hack format. - **Single-objective**: 매 hacking. ## 🧪 검증 / 중복 - Verified (DeepSeek-Math 2024, DeepSeek-R1 2025, TRL docs). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — GRPO + 매 TRL / R1 / multi-reward / pipeline code |