f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
187 lines
6.5 KiB
Markdown
187 lines
6.5 KiB
Markdown
---
|
||
id: wiki-2026-0508-reward-shaping-in-rl
|
||
title: Reward Shaping in RL
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Reward Shaping, Shaped Reward, Dense Reward Design]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.95
|
||
verification_status: applied
|
||
tags: [reinforcement-learning, reward-design, RLHF, GRPO, sparse-reward]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: Python
|
||
framework: PyTorch/Gymnasium/TRL
|
||
---
|
||
|
||
# Reward Shaping in RL
|
||
|
||
## 매 한 줄
|
||
> **"매 sparse reward → dense intermediate signal — without changing optimal policy."**. Ng, Harada, Russell 1999 ("Policy Invariance Under Reward Transformations") 의 prove 의 매 potential-based shaping F(s,s') = γΦ(s') − Φ(s) 가 optimal policy 의 preserve, 매 modern RLHF/GRPO/RLVR 의 reward design 의 foundation 의.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 핵심 theorem (Ng et al. 1999)
|
||
- Shaped reward: r'(s, a, s') = r(s, a, s') + F(s, s').
|
||
- F(s, s') = γ·Φ(s') − Φ(s) (potential-based) → policy invariance guaranteed.
|
||
- 의 X 가 well-defined Φ — 매 arbitrary bonus 의 optimal policy 의 distort 의 가능.
|
||
|
||
### 매 shaping types
|
||
- **Potential-based** (theory-safe): heuristic value Φ(s).
|
||
- **Curiosity / intrinsic motivation**: ICM, RND — exploration bonus.
|
||
- **Demonstrations (LfD)**: shaped reward from expert similarity.
|
||
- **Curriculum**: progressively harder targets.
|
||
- **RLHF reward model**: human-trained dense reward.
|
||
- **RLVR (verifiable)**: rule-based pass/fail (math, code) — sparse but exact.
|
||
- **GRPO advantages** (DeepSeek 2024-25): group-relative normalization replaces critic.
|
||
|
||
### 매 응용
|
||
1. Sparse-reward locomotion / manipulation.
|
||
2. Game RL (StarCraft II, Atari hard-exploration).
|
||
3. RLHF for LLM alignment.
|
||
4. RLVR/GRPO for math/code (DeepSeek-R1, o1).
|
||
5. Robotics imitation + RL hybrid.
|
||
|
||
## 💻 패턴
|
||
|
||
### Potential-Based Shaping (Ng 1999)
|
||
```python
|
||
def potential(state) -> float:
|
||
"""매 heuristic 의 — e.g. 의 distance-to-goal."""
|
||
return -goal_distance(state)
|
||
|
||
def shaped_reward(r, s, s_next, gamma=0.99):
|
||
return r + gamma * potential(s_next) - potential(s)
|
||
```
|
||
|
||
### Curiosity-Driven (RND)
|
||
```python
|
||
import torch
|
||
import torch.nn as nn
|
||
|
||
class RND(nn.Module):
|
||
def __init__(self, obs_dim, feat_dim=128):
|
||
super().__init__()
|
||
self.target = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
|
||
nn.Linear(256, feat_dim))
|
||
for p in self.target.parameters(): p.requires_grad_(False)
|
||
self.predictor = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
|
||
nn.Linear(256, feat_dim))
|
||
|
||
def intrinsic(self, obs):
|
||
return ((self.predictor(obs) - self.target(obs)) ** 2).mean(-1)
|
||
```
|
||
|
||
### Curriculum Reward
|
||
```python
|
||
def curriculum_target(episode_idx, easy_target, hard_target, ramp_episodes=10000):
|
||
t = min(episode_idx / ramp_episodes, 1.0)
|
||
return easy_target + t * (hard_target - easy_target)
|
||
```
|
||
|
||
### RLHF Reward Model
|
||
```python
|
||
import torch.nn as nn
|
||
from transformers import AutoModel
|
||
|
||
class RewardModel(nn.Module):
|
||
def __init__(self, base="meta-llama/Llama-3-8b"):
|
||
super().__init__()
|
||
self.backbone = AutoModel.from_pretrained(base)
|
||
self.head = nn.Linear(self.backbone.config.hidden_size, 1)
|
||
|
||
def forward(self, input_ids, attn):
|
||
out = self.backbone(input_ids, attn).last_hidden_state
|
||
last = out[:, -1]
|
||
return self.head(last).squeeze(-1)
|
||
|
||
# Bradley-Terry pairwise loss
|
||
def bt_loss(r_chosen, r_rejected):
|
||
return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()
|
||
```
|
||
|
||
### RLVR — Verifiable Rule Reward
|
||
```python
|
||
def rlvr_reward(generated: str, gold: str, task: str) -> float:
|
||
if task == "math":
|
||
return 1.0 if extract_answer(generated) == gold else 0.0
|
||
elif task == "code":
|
||
return float(run_unit_tests(generated))
|
||
elif task == "format":
|
||
return 1.0 if has_required_tags(generated) else 0.0
|
||
```
|
||
|
||
### GRPO Advantage (DeepSeek 2024)
|
||
```python
|
||
import numpy as np
|
||
|
||
def grpo_advantages(group_rewards: np.ndarray) -> np.ndarray:
|
||
"""매 group-relative normalization — critic 의 X."""
|
||
mean = group_rewards.mean()
|
||
std = group_rewards.std() + 1e-8
|
||
return (group_rewards - mean) / std
|
||
|
||
# Usage: sample G=8 outputs per prompt, compute rewards, normalize within group
|
||
```
|
||
|
||
### Combined Shaping
|
||
```python
|
||
def combined_reward(r_env, s, s_next, model, obs, gamma=0.99,
|
||
pot_w=1.0, cur_w=0.1):
|
||
pot = gamma * potential(s_next) - potential(s)
|
||
cur = model.intrinsic(obs).item()
|
||
return r_env + pot_w * pot + cur_w * cur
|
||
```
|
||
|
||
### Reward Hacking Detector
|
||
```python
|
||
def detect_hacking(rewards, true_returns, window=100):
|
||
"""매 reward 의 up 의 X 의 true return 의 stagnant → hacking."""
|
||
if len(rewards) < window: return False
|
||
rew_trend = np.polyfit(range(window), rewards[-window:], 1)[0]
|
||
ret_trend = np.polyfit(range(window), true_returns[-window:], 1)[0]
|
||
return rew_trend > 0.01 and ret_trend < 0
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 상황 | Approach |
|
||
|---|---|
|
||
| Sparse reward, known heuristic | Potential-based shaping |
|
||
| Hard exploration | RND / ICM curiosity |
|
||
| Have expert demos | LfD-shaped reward + BC pretrain |
|
||
| LLM alignment, subjective | RLHF reward model |
|
||
| LLM math/code | RLVR (rule-based) + GRPO |
|
||
| Robotic manipulation | Combined: potential + curiosity + demo |
|
||
|
||
**기본값**: Potential-based primary; RLVR + GRPO 의 LLM verifiable tasks 의; RLHF 의 subjective tasks 의.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Reinforcement Learning]] · [[Reward Design]]
|
||
- 변형: [[GRPO]] · [[RLHF]]
|
||
- Adjacent: [[Reward Prediction Error]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: reward model training (RLHF), reward function code generation, reward hacking analysis from logs.
|
||
**언제 X**: LLM 의 reward function 의 propose 의 hacking 의 prone 의 — verify 의 with controlled rollouts.
|
||
|
||
## ❌ 안티패턴
|
||
- **Non-potential bonus**: arbitrary +10 의 sub-goal 의 reach → optimal policy 의 distort.
|
||
- **Reward hacking ignored**: cumulative reward up 의 task fail 의 monitor 의 X.
|
||
- **Over-shaping**: dense bonus 의 overwhelm sparse signal → agent 의 task 의 ignore.
|
||
- **Static curriculum**: agent 의 surpass 의 still serving easy targets.
|
||
- **No baseline check**: shaping with vs without 의 ablation 의 X — actual gain unknown.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Ng/Harada/Russell 1999 ICML; DeepSeek-R1 paper 2025; Sutton & Barto Ch 17).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — potential-based + RND + RLHF + GRPO + RLVR |
|