Files
2nd/10_Wiki/Topics/AI_and_ML/Reward-Shaping-in-RL.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

187 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-reward-shaping-in-rl
title: Reward Shaping in RL
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Reward Shaping, Shaped Reward, Dense Reward Design]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [reinforcement-learning, reward-design, RLHF, GRPO, sparse-reward]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch/Gymnasium/TRL
---
# Reward Shaping in RL
## 매 한 줄
> **"매 sparse reward → dense intermediate signal — without changing optimal policy."**. Ng, Harada, Russell 1999 ("Policy Invariance Under Reward Transformations") 의 prove 의 매 potential-based shaping F(s,s') = γΦ(s') Φ(s) 가 optimal policy 의 preserve, 매 modern RLHF/GRPO/RLVR 의 reward design 의 foundation 의.
## 매 핵심
### 매 핵심 theorem (Ng et al. 1999)
- Shaped reward: r'(s, a, s') = r(s, a, s') + F(s, s').
- F(s, s') = γ·Φ(s') Φ(s) (potential-based) → policy invariance guaranteed.
- 의 X 가 well-defined Φ — 매 arbitrary bonus 의 optimal policy 의 distort 의 가능.
### 매 shaping types
- **Potential-based** (theory-safe): heuristic value Φ(s).
- **Curiosity / intrinsic motivation**: ICM, RND — exploration bonus.
- **Demonstrations (LfD)**: shaped reward from expert similarity.
- **Curriculum**: progressively harder targets.
- **RLHF reward model**: human-trained dense reward.
- **RLVR (verifiable)**: rule-based pass/fail (math, code) — sparse but exact.
- **GRPO advantages** (DeepSeek 2024-25): group-relative normalization replaces critic.
### 매 응용
1. Sparse-reward locomotion / manipulation.
2. Game RL (StarCraft II, Atari hard-exploration).
3. RLHF for LLM alignment.
4. RLVR/GRPO for math/code (DeepSeek-R1, o1).
5. Robotics imitation + RL hybrid.
## 💻 패턴
### Potential-Based Shaping (Ng 1999)
```python
def potential(state) -> float:
"""매 heuristic 의 — e.g. 의 distance-to-goal."""
return -goal_distance(state)
def shaped_reward(r, s, s_next, gamma=0.99):
return r + gamma * potential(s_next) - potential(s)
```
### Curiosity-Driven (RND)
```python
import torch
import torch.nn as nn
class RND(nn.Module):
def __init__(self, obs_dim, feat_dim=128):
super().__init__()
self.target = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
nn.Linear(256, feat_dim))
for p in self.target.parameters(): p.requires_grad_(False)
self.predictor = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
nn.Linear(256, feat_dim))
def intrinsic(self, obs):
return ((self.predictor(obs) - self.target(obs)) ** 2).mean(-1)
```
### Curriculum Reward
```python
def curriculum_target(episode_idx, easy_target, hard_target, ramp_episodes=10000):
t = min(episode_idx / ramp_episodes, 1.0)
return easy_target + t * (hard_target - easy_target)
```
### RLHF Reward Model
```python
import torch.nn as nn
from transformers import AutoModel
class RewardModel(nn.Module):
def __init__(self, base="meta-llama/Llama-3-8b"):
super().__init__()
self.backbone = AutoModel.from_pretrained(base)
self.head = nn.Linear(self.backbone.config.hidden_size, 1)
def forward(self, input_ids, attn):
out = self.backbone(input_ids, attn).last_hidden_state
last = out[:, -1]
return self.head(last).squeeze(-1)
# Bradley-Terry pairwise loss
def bt_loss(r_chosen, r_rejected):
return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()
```
### RLVR — Verifiable Rule Reward
```python
def rlvr_reward(generated: str, gold: str, task: str) -> float:
if task == "math":
return 1.0 if extract_answer(generated) == gold else 0.0
elif task == "code":
return float(run_unit_tests(generated))
elif task == "format":
return 1.0 if has_required_tags(generated) else 0.0
```
### GRPO Advantage (DeepSeek 2024)
```python
import numpy as np
def grpo_advantages(group_rewards: np.ndarray) -> np.ndarray:
"""매 group-relative normalization — critic 의 X."""
mean = group_rewards.mean()
std = group_rewards.std() + 1e-8
return (group_rewards - mean) / std
# Usage: sample G=8 outputs per prompt, compute rewards, normalize within group
```
### Combined Shaping
```python
def combined_reward(r_env, s, s_next, model, obs, gamma=0.99,
pot_w=1.0, cur_w=0.1):
pot = gamma * potential(s_next) - potential(s)
cur = model.intrinsic(obs).item()
return r_env + pot_w * pot + cur_w * cur
```
### Reward Hacking Detector
```python
def detect_hacking(rewards, true_returns, window=100):
"""매 reward 의 up 의 X 의 true return 의 stagnant → hacking."""
if len(rewards) < window: return False
rew_trend = np.polyfit(range(window), rewards[-window:], 1)[0]
ret_trend = np.polyfit(range(window), true_returns[-window:], 1)[0]
return rew_trend > 0.01 and ret_trend < 0
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Sparse reward, known heuristic | Potential-based shaping |
| Hard exploration | RND / ICM curiosity |
| Have expert demos | LfD-shaped reward + BC pretrain |
| LLM alignment, subjective | RLHF reward model |
| LLM math/code | RLVR (rule-based) + GRPO |
| Robotic manipulation | Combined: potential + curiosity + demo |
**기본값**: Potential-based primary; RLVR + GRPO 의 LLM verifiable tasks 의; RLHF 의 subjective tasks 의.
## 🔗 Graph
- 부모: [[Reinforcement Learning]] · [[Reward Design]]
- 변형: [[GRPO]] · [[RLHF]]
- Adjacent: [[Reward Prediction Error]]
## 🤖 LLM 활용
**언제**: reward model training (RLHF), reward function code generation, reward hacking analysis from logs.
**언제 X**: LLM 의 reward function 의 propose 의 hacking 의 prone 의 — verify 의 with controlled rollouts.
## ❌ 안티패턴
- **Non-potential bonus**: arbitrary +10 의 sub-goal 의 reach → optimal policy 의 distort.
- **Reward hacking ignored**: cumulative reward up 의 task fail 의 monitor 의 X.
- **Over-shaping**: dense bonus 의 overwhelm sparse signal → agent 의 task 의 ignore.
- **Static curriculum**: agent 의 surpass 의 still serving easy targets.
- **No baseline check**: shaping with vs without 의 ablation 의 X — actual gain unknown.
## 🧪 검증 / 중복
- Verified (Ng/Harada/Russell 1999 ICML; DeepSeek-R1 paper 2025; Sutton & Barto Ch 17).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — potential-based + RND + RLHF + GRPO + RLVR |