f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
255 lines
7.4 KiB
Markdown
255 lines
7.4 KiB
Markdown
---
|
|
id: wiki-2026-0508-grpo
|
|
title: GRPO (Group Relative Policy Optimization)
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [GRPO, group relative policy optimization, DeepSeek R1, RL fine-tune]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.92
|
|
verification_status: applied
|
|
tags: [rl, grpo, deepseek, reasoning, llm-fine-tune, ppo-alternative]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: Python
|
|
framework: TRL / DeepSeek
|
|
---
|
|
|
|
# GRPO (Group Relative Policy Optimization)
|
|
|
|
## 매 한 줄
|
|
> **"매 PPO 의 critic-free variant — 매 group 의 sample 의 의 의 baseline"**. DeepSeek 2024-2025. 매 R1 reasoning 의 enable. 매 reward model 의 의 의 X (rule-based reward 의 충분). 매 modern RLHF / reasoning 의 popular.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 vs PPO
|
|
- **PPO**: 매 critic (value network).
|
|
- **GRPO**: 매 group sample 의 mean 의 baseline.
|
|
- **Result**: 매 simpler, 매 reasoning 의 strong.
|
|
|
|
### 매 algorithm
|
|
1. 매 prompt → 매 G rollouts (different responses).
|
|
2. 매 reward 의 매 rollout 의 score.
|
|
3. 매 advantage = (reward - group_mean) / group_std.
|
|
4. 매 PPO-style clipped objective.
|
|
|
|
### 매 famous
|
|
- **DeepSeek-Math** (2024).
|
|
- **DeepSeek-R1** (2025): 매 reasoning emerge.
|
|
|
|
### 매 응용
|
|
1. **Math reasoning**.
|
|
2. **Code generation**.
|
|
3. **Tool use**.
|
|
4. **Long CoT**.
|
|
|
|
## 💻 패턴
|
|
|
|
### Basic GRPO loop
|
|
```python
|
|
import torch
|
|
import torch.nn.functional as F
|
|
|
|
def grpo_step(policy, ref_policy, prompts, reward_fn, group_size=8, beta=0.04, eps=0.2):
|
|
advantages_all = []
|
|
log_probs_old_all = []
|
|
log_probs_ref_all = []
|
|
responses_all = []
|
|
|
|
for prompt in prompts:
|
|
# 매 G rollouts
|
|
rollouts = []
|
|
rewards = []
|
|
for _ in range(group_size):
|
|
response = policy.generate(prompt, do_sample=True)
|
|
r = reward_fn(prompt, response)
|
|
rollouts.append(response); rewards.append(r)
|
|
|
|
rewards = torch.tensor(rewards)
|
|
# 매 group baseline
|
|
adv = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
|
|
advantages_all.extend(adv.tolist())
|
|
|
|
# 매 log prob
|
|
for resp in rollouts:
|
|
log_probs_old_all.append(policy.log_prob(prompt, resp).detach())
|
|
log_probs_ref_all.append(ref_policy.log_prob(prompt, resp).detach())
|
|
responses_all.append((prompt, resp))
|
|
|
|
# 매 PPO-style update
|
|
for _ in range(4): # 매 ppo epochs
|
|
for (prompt, resp), adv, log_old, log_ref in zip(responses_all, advantages_all, log_probs_old_all, log_probs_ref_all):
|
|
log_new = policy.log_prob(prompt, resp)
|
|
ratio = (log_new - log_old).exp()
|
|
|
|
obj1 = ratio * adv
|
|
obj2 = ratio.clamp(1 - eps, 1 + eps) * adv
|
|
policy_loss = -torch.min(obj1, obj2).mean()
|
|
|
|
# 매 KL penalty (vs ref)
|
|
kl = log_new - log_ref
|
|
kl_loss = beta * kl.mean()
|
|
|
|
loss = policy_loss + kl_loss
|
|
loss.backward()
|
|
optim.step(); optim.zero_grad()
|
|
```
|
|
|
|
### Rule-based reward (math)
|
|
```python
|
|
def math_reward(prompt, response):
|
|
"""매 deepseek-style: extract answer, verify."""
|
|
answer = extract_answer(response)
|
|
expected = extract_answer(prompt['solution'])
|
|
|
|
correctness = 1.0 if answer == expected else 0.0
|
|
format_bonus = 0.1 if has_required_format(response) else 0
|
|
|
|
return correctness + format_bonus
|
|
```
|
|
|
|
### TRL implementation
|
|
```python
|
|
from trl import GRPOTrainer, GRPOConfig
|
|
|
|
trainer = GRPOTrainer(
|
|
model='Qwen/Qwen2.5-7B',
|
|
reward_funcs=[correctness_reward, format_reward],
|
|
args=GRPOConfig(
|
|
output_dir='out',
|
|
num_generations=8, # 매 group size
|
|
per_device_train_batch_size=1,
|
|
gradient_accumulation_steps=8,
|
|
learning_rate=5e-6,
|
|
max_prompt_length=512,
|
|
max_completion_length=1024,
|
|
beta=0.04,
|
|
),
|
|
train_dataset=ds,
|
|
)
|
|
trainer.train()
|
|
```
|
|
|
|
### Multi-objective reward
|
|
```python
|
|
def multi_reward(prompt, response):
|
|
rewards = {}
|
|
rewards['correctness'] = correctness(prompt, response)
|
|
rewards['format'] = check_format(response)
|
|
rewards['length'] = -abs(len(response) - 500) / 1000 # 매 prefer ~500 tokens
|
|
rewards['cot_quality'] = check_reasoning_quality(response)
|
|
|
|
weights = {'correctness': 1.0, 'format': 0.1, 'length': 0.05, 'cot_quality': 0.3}
|
|
return sum(rewards[k] * weights[k] for k in rewards)
|
|
```
|
|
|
|
### Reasoning-focused (R1-style)
|
|
```python
|
|
THINK_FORMAT = """
|
|
<think>
|
|
{reasoning}
|
|
</think>
|
|
<answer>
|
|
{answer}
|
|
</answer>
|
|
"""
|
|
|
|
def r1_format_reward(response):
|
|
has_think = '<think>' in response and '</think>' in response
|
|
has_answer = '<answer>' in response and '</answer>' in response
|
|
return 0.5 if (has_think and has_answer) else 0
|
|
```
|
|
|
|
### Self-consistency (best-of-N at eval)
|
|
```python
|
|
def best_of_n_eval(model, prompt, n=16):
|
|
responses = [model.generate(prompt, do_sample=True) for _ in range(n)]
|
|
answers = [extract_answer(r) for r in responses]
|
|
# 매 majority vote
|
|
from collections import Counter
|
|
return Counter(answers).most_common(1)[0][0]
|
|
```
|
|
|
|
### KL control
|
|
```python
|
|
def adaptive_beta(target_kl, current_kl, beta):
|
|
if current_kl > 1.5 * target_kl: return beta * 1.5
|
|
if current_kl < 0.5 * target_kl: return beta / 1.5
|
|
return beta
|
|
```
|
|
|
|
### Reward hacking detection
|
|
```python
|
|
def detect_reward_hacking(rollouts, rewards):
|
|
"""매 high reward 의 의 의 quality 의 X?"""
|
|
high_reward = [r for r, score in zip(rollouts, rewards) if score > 0.9]
|
|
quality = [llm_judge_quality(r) for r in high_reward]
|
|
if np.mean(quality) < 0.5:
|
|
return 'WARN: high reward but low quality — possibly hacking'
|
|
return None
|
|
```
|
|
|
|
### Process reward (PRM)
|
|
```python
|
|
def process_reward(steps):
|
|
"""매 step-by-step verify."""
|
|
return sum(prm_score(step) for step in steps) / len(steps)
|
|
```
|
|
|
|
### Iterative training (R1-style)
|
|
```python
|
|
def r1_pipeline(base_model, dataset):
|
|
# 매 stage 1: reasoning data SFT
|
|
sft_model = sft(base_model, reasoning_data)
|
|
|
|
# 매 stage 2: GRPO
|
|
grpo_model = grpo(sft_model, dataset, math_reward)
|
|
|
|
# 매 stage 3: rejection sampling — 매 high-quality 의 SFT 다시
|
|
rs_data = filter_high_quality(grpo_model.generate_many(dataset))
|
|
final = sft(grpo_model, rs_data)
|
|
|
|
return final
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Reasoning task | GRPO + rule reward |
|
|
| Preference align | DPO / PPO |
|
|
| Code | GRPO + execution reward |
|
|
| General chat | RLHF / DPO |
|
|
| Tool use | GRPO + success reward |
|
|
| Cost-aware | GRPO (no critic) |
|
|
|
|
**기본값**: 매 reasoning = GRPO + rule + format reward + iterative + KL control.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[RLHF]] · [[Reinforcement-Learning]]
|
|
- 변형: [[PPO]] · [[DPO]]
|
|
- 응용: [[DeepSeek-R1]]
|
|
- Adjacent: [[Fine-tuning]] · [[Foundation-Models]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 reasoning, math, code. 매 verifiable reward.
|
|
**언제 X**: 매 subjective preference (use DPO).
|
|
|
|
## ❌ 안티패턴
|
|
- **No KL control**: 매 reward hack drift.
|
|
- **Tiny group**: 매 noisy advantage.
|
|
- **No rule for format**: 매 hack format.
|
|
- **Single-objective**: 매 hacking.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (DeepSeek-Math 2024, DeepSeek-R1 2025, TRL docs).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — GRPO + 매 TRL / R1 / multi-reward / pipeline code |
|