2nd/10_Wiki/Topics/AI_and_ML/Reward-Shaping-in-RL.md

---
id: wiki-2026-0508-reward-shaping-in-rl
title: Reward Shaping in RL
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Reward Shaping, Shaped Reward, Dense Reward Design]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [reinforcement-learning, reward-design, RLHF, GRPO, sparse-reward]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: Python
  framework: PyTorch/Gymnasium/TRL
---

# Reward Shaping in RL

## 매 한 줄
> **"매 sparse reward → dense intermediate signal — without changing optimal policy."**. Ng, Harada, Russell 1999 ("Policy Invariance Under Reward Transformations") 의 prove 의 매 potential-based shaping F(s,s') = γΦ(s') − Φ(s) 가 optimal policy 의 preserve, 매 modern RLHF/GRPO/RLVR 의 reward design 의 foundation 의.

## 매 핵심

### 매 핵심 theorem (Ng et al. 1999)
- Shaped reward: r'(s, a, s') = r(s, a, s') + F(s, s').
- F(s, s') = γ·Φ(s') − Φ(s) (potential-based) → policy invariance guaranteed.
- 의 X 가 well-defined Φ — 매 arbitrary bonus 의 optimal policy 의 distort 의 가능.

### 매 shaping types
- **Potential-based** (theory-safe): heuristic value Φ(s).
- **Curiosity / intrinsic motivation**: ICM, RND — exploration bonus.
- **Demonstrations (LfD)**: shaped reward from expert similarity.
- **Curriculum**: progressively harder targets.
- **RLHF reward model**: human-trained dense reward.
- **RLVR (verifiable)**: rule-based pass/fail (math, code) — sparse but exact.
- **GRPO advantages** (DeepSeek 2024-25): group-relative normalization replaces critic.

### 매 응용
1. Sparse-reward locomotion / manipulation.
2. Game RL (StarCraft II, Atari hard-exploration).
3. RLHF for LLM alignment.
4. RLVR/GRPO for math/code (DeepSeek-R1, o1).
5. Robotics imitation + RL hybrid.

## 💻 패턴

### Potential-Based Shaping (Ng 1999)
```python
def potential(state) -> float:
    """매 heuristic 의 — e.g. 의 distance-to-goal."""
    return -goal_distance(state)

def shaped_reward(r, s, s_next, gamma=0.99):
    return r + gamma * potential(s_next) - potential(s)
```

### Curiosity-Driven (RND)
```python
import torch
import torch.nn as nn

class RND(nn.Module):
    def __init__(self, obs_dim, feat_dim=128):
        super().__init__()
        self.target = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
                                    nn.Linear(256, feat_dim))
        for p in self.target.parameters(): p.requires_grad_(False)
        self.predictor = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
                                       nn.Linear(256, feat_dim))

    def intrinsic(self, obs):
        return ((self.predictor(obs) - self.target(obs)) ** 2).mean(-1)
```

### Curriculum Reward
```python
def curriculum_target(episode_idx, easy_target, hard_target, ramp_episodes=10000):
    t = min(episode_idx / ramp_episodes, 1.0)
    return easy_target + t * (hard_target - easy_target)
```

### RLHF Reward Model
```python
import torch.nn as nn
from transformers import AutoModel

class RewardModel(nn.Module):
    def __init__(self, base="meta-llama/Llama-3-8b"):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(base)
        self.head = nn.Linear(self.backbone.config.hidden_size, 1)

    def forward(self, input_ids, attn):
        out = self.backbone(input_ids, attn).last_hidden_state
        last = out[:, -1]
        return self.head(last).squeeze(-1)

# Bradley-Terry pairwise loss
def bt_loss(r_chosen, r_rejected):
    return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()
```

### RLVR — Verifiable Rule Reward
```python
def rlvr_reward(generated: str, gold: str, task: str) -> float:
    if task == "math":
        return 1.0 if extract_answer(generated) == gold else 0.0
    elif task == "code":
        return float(run_unit_tests(generated))
    elif task == "format":
        return 1.0 if has_required_tags(generated) else 0.0
```

### GRPO Advantage (DeepSeek 2024)
```python
import numpy as np

def grpo_advantages(group_rewards: np.ndarray) -> np.ndarray:
    """매 group-relative normalization — critic 의 X."""
    mean = group_rewards.mean()
    std = group_rewards.std() + 1e-8
    return (group_rewards - mean) / std

# Usage: sample G=8 outputs per prompt, compute rewards, normalize within group
```

### Combined Shaping
```python
def combined_reward(r_env, s, s_next, model, obs, gamma=0.99,
                    pot_w=1.0, cur_w=0.1):
    pot = gamma * potential(s_next) - potential(s)
    cur = model.intrinsic(obs).item()
    return r_env + pot_w * pot + cur_w * cur
```

### Reward Hacking Detector
```python
def detect_hacking(rewards, true_returns, window=100):
    """매 reward 의 up 의 X 의 true return 의 stagnant → hacking."""
    if len(rewards) < window: return False
    rew_trend = np.polyfit(range(window), rewards[-window:], 1)[0]
    ret_trend = np.polyfit(range(window), true_returns[-window:], 1)[0]
    return rew_trend > 0.01 and ret_trend < 0
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Sparse reward, known heuristic | Potential-based shaping |
| Hard exploration | RND / ICM curiosity |
| Have expert demos | LfD-shaped reward + BC pretrain |
| LLM alignment, subjective | RLHF reward model |
| LLM math/code | RLVR (rule-based) + GRPO |
| Robotic manipulation | Combined: potential + curiosity + demo |

**기본값**: Potential-based primary; RLVR + GRPO 의 LLM verifiable tasks 의; RLHF 의 subjective tasks 의.

## 🔗 Graph
- 부모: [[Reinforcement Learning]] · [[Reward Design]]
- 변형: [[GRPO]] · [[RLHF]]
- Adjacent: [[Reward Prediction Error]]

## 🤖 LLM 활용
**언제**: reward model training (RLHF), reward function code generation, reward hacking analysis from logs.
**언제 X**: LLM 의 reward function 의 propose 의 hacking 의 prone 의 — verify 의 with controlled rollouts.

## ❌ 안티패턴
- **Non-potential bonus**: arbitrary +10 의 sub-goal 의 reach → optimal policy 의 distort.
- **Reward hacking ignored**: cumulative reward up 의 task fail 의 monitor 의 X.
- **Over-shaping**: dense bonus 의 overwhelm sparse signal → agent 의 task 의 ignore.
- **Static curriculum**: agent 의 surpass 의 still serving easy targets.
- **No baseline check**: shaping with vs without 의 ablation 의 X — actual gain unknown.

## 🧪 검증 / 중복
- Verified (Ng/Harada/Russell 1999 ICML; DeepSeek-R1 paper 2025; Sutton & Barto Ch 17).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — potential-based + RND + RLHF + GRPO + RLVR |