Files
2nd/10_Wiki/Topics/AI_and_ML/Policy-Optimization.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

209 lines
7.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-policy-optimization
title: Policy Optimization
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [policy-gradient, ppo, trpo, grpo, dpo, rlhf-optimization]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [reinforcement-learning, ppo, grpo, dpo, rlhf]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch / TRL
---
# Policy Optimization
## 매 한 줄
> **"매 policy π_θ 의 reward expectation 의 직접 maximize"**. 매 vanilla PG (REINFORCE) → A2C/A3C → 매 TRPO (trust region) → 매 PPO (clip surrogate, 2017) → 매 GRPO (group-relative, DeepSeek 2024) → 매 DPO (preference, 2023). 매 modern LLM RLHF 의 backbone.
## 매 핵심
### 매 algorithm 계보
- **매 REINFORCE (1992)**: ∇J = E[∇log π · R]. 매 high variance.
- **매 A2C/A3C (2016)**: actor-critic, advantage A = Q - V. 매 lower variance.
- **매 TRPO (2015)**: trust region — KL constraint. 매 monotonic improvement guarantee. 매 expensive (Fisher).
- **매 PPO (2017, Schulman)**: clipped surrogate r·A vs clip(r, 1-ε, 1+ε)·A. 매 first-order, 매 simple, 매 dominant 2017-2023.
- **매 GRPO (2024, DeepSeek)**: PPO 의 critic 의 제거 — 매 group-relative advantage (mean of K samples). 매 efficient for LLM RL.
- **매 DPO (2023, Rafailov)**: 매 reward model 의 우회 — 매 preference data 의 closed-form policy update. 매 RLHF simplified.
- **매 GSPO, KTO, ORPO** (2024): DPO variants.
### 매 PPO clip objective
```
L_CLIP(θ) = E[ min( r·A, clip(r, 1-ε, 1+ε)·A ) ]
where r = π_θ(a|s) / π_old(a|s)
```
### 매 GRPO (DeepSeek-Math/R1)
```
A_i = (R_i - mean(R)) / std(R) # group-relative
L = E[ min(r·A, clip(r, 1-ε, 1+ε)·A) - β·KL(π||π_ref) ]
```
매 critic 의 사용 X — 매 sample group 의 baseline 으로.
### 매 DPO objective
```
L_DPO = -E[ log σ( β·log(π(y_w|x)/π_ref(y_w|x)) - β·log(π(y_l|x)/π_ref(y_l|x)) ) ]
```
매 chosen y_w + rejected y_l 의 directly optimize.
### 매 응용
1. 매 LLM RLHF (PPO → GRPO → DPO).
2. 매 robot control (PPO).
3. 매 game-playing (OpenAI Five, AlphaStar).
4. 매 LLM reasoning (R1-style RL).
## 💻 패턴
### PPO — minimal (CleanRL-style)
```python
import torch, torch.nn as nn
import torch.nn.functional as F
class ActorCritic(nn.Module):
def __init__(self, obs_dim, act_dim):
super().__init__()
self.actor = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(),
nn.Linear(64, 64), nn.Tanh(),
nn.Linear(64, act_dim))
self.critic = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(),
nn.Linear(64, 64), nn.Tanh(),
nn.Linear(64, 1))
def ppo_update(net, opt, obs, acts, old_logp, advs, returns, eps=0.2, c_v=0.5, c_e=0.01):
logits = net.actor(obs)
dist = torch.distributions.Categorical(logits=logits)
logp = dist.log_prob(acts)
ratio = (logp - old_logp).exp()
surr1 = ratio * advs
surr2 = ratio.clamp(1-eps, 1+eps) * advs
pg_loss = -torch.min(surr1, surr2).mean()
v = net.critic(obs).squeeze(-1)
v_loss = F.mse_loss(v, returns)
ent = dist.entropy().mean()
loss = pg_loss + c_v * v_loss - c_e * ent
opt.zero_grad(); loss.backward()
nn.utils.clip_grad_norm_(net.parameters(), 0.5); opt.step()
```
### GAE (Generalized Advantage Estimation)
```python
def gae(rewards, values, dones, last_v, gamma=0.99, lam=0.95):
advs = torch.zeros_like(rewards)
g = 0
for t in reversed(range(len(rewards))):
next_v = last_v if t == len(rewards)-1 else values[t+1]
delta = rewards[t] + gamma * next_v * (1 - dones[t]) - values[t]
g = delta + gamma * lam * (1 - dones[t]) * g
advs[t] = g
return advs, advs + values
```
### GRPO — DeepSeek-style (TRL)
```python
from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
def reward_fn(prompts, completions, **kwargs):
# 매 e.g. correctness check for math problems
return [1.0 if check_answer(c) else 0.0 for c in completions]
config = GRPOConfig(
num_generations=8, # 매 group size K
learning_rate=1e-6,
beta=0.04, # KL penalty
max_prompt_length=512, max_completion_length=512,
)
trainer = GRPOTrainer(model=model, reward_funcs=reward_fn, args=config,
train_dataset=ds, processing_class=tok)
trainer.train()
```
### DPO (TRL)
```python
from trl import DPOTrainer, DPOConfig
# Dataset: {"prompt": str, "chosen": str, "rejected": str}
config = DPOConfig(beta=0.1, learning_rate=5e-7, max_length=1024)
trainer = DPOTrainer(model=model, ref_model=ref_model, args=config,
train_dataset=preference_ds, processing_class=tok)
trainer.train()
```
### Reward shaping for GRPO (math + format)
```python
import re
def reward_correctness(completions, ground_truth, **k):
return [1.0 if extract_answer(c) == gt else 0.0
for c, gt in zip(completions, ground_truth)]
def reward_format(completions, **k):
# 매 <think>...</think><answer>...</answer> 의 강요
pat = re.compile(r"<think>.*?</think>\s*<answer>.*?</answer>", re.S)
return [0.5 if pat.search(c) else 0.0 for c in completions]
# Combine in TRL: pass as list reward_funcs=[reward_correctness, reward_format]
```
### KL penalty (PPO-RLHF)
```python
# 매 reference model 매 anchor 의 사용 — 매 RLHF 의 stay close to SFT
log_ratio = logp_policy - logp_ref
kl = (log_ratio.exp() - 1 - log_ratio).mean() # 매 unbiased k3 estimator
loss = pg_loss + beta * kl
```
### TRPO line-search (sketch)
```python
# 매 modern code 매 PPO 의 사용 — TRPO 매 reference only
# 1. compute natural gradient: F^-1 g (Fisher inverse via conjugate gradient)
# 2. line-search with KL ≤ δ constraint
# 3. accept step if surrogate improves and KL within budget
```
## 매 결정 기준
| 상황 | Algorithm |
|---|---|
| 매 standard RL benchmark (Atari, MuJoCo) | 매 PPO |
| 매 LLM RL with verifiable reward | 매 GRPO |
| 매 LLM preference data (no reward model) | 매 DPO |
| 매 LLM RLHF (with RM) | 매 PPO or GRPO |
| 매 sample-efficient continuous control | 매 SAC (off-policy) |
| 매 monotonic improvement guarantee | 매 TRPO (rare in practice) |
**기본값**: 매 PPO (RL benchmark) / GRPO (LLM RL) / DPO (LLM preference).
## 🔗 Graph
- 부모: [[Reinforcement-Learning]] · [[RLHF]]
- 변형: [[PPO]] · [[GRPO]] · [[DPO]] · [[TRPO]] · [[A2C]]
## 🤖 LLM 활용
**언제**: 매 PPO 매 baseline RL, 매 GRPO 매 LLM verifiable-reward task (math, code), 매 DPO 매 preference data only 매 사용.
**언제 X**: 매 sample-efficiency critical (off-policy: SAC, TD3), 매 ground-truth label exists (supervised 의 사용).
## ❌ 안티패턴
- **매 huge KL divergence allow**: 매 policy 매 ref 보다 collapse → 매 reward hacking.
- **매 advantage 의 normalize 안 함**: 매 PPO 매 batch advantage normalization 의 critical.
- **매 single epoch only**: 매 PPO 매 multiple epochs (3-10) 의 importance ratio 의 활용.
- **매 GRPO without group**: 매 group size 1 → 매 advantage = 0.
## 🧪 검증 / 중복
- Verified (PPO Schulman 2017, GRPO DeepSeek-Math 2024, DPO Rafailov 2023, TRL docs).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — PPO/GRPO/DPO + GAE + TRL patterns |