--- id: wiki-2026-0508-reward-shaping-in-rl title: Reward Shaping in RL category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Reward Shaping, Shaped Reward, Dense Reward Design] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [reinforcement-learning, reward-design, RLHF, GRPO, sparse-reward] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch/Gymnasium/TRL --- # Reward Shaping in RL ## 매 한 줄 > **"매 sparse reward → dense intermediate signal — without changing optimal policy."**. Ng, Harada, Russell 1999 ("Policy Invariance Under Reward Transformations") 의 prove 의 매 potential-based shaping F(s,s') = γΦ(s') − Φ(s) 가 optimal policy 의 preserve, 매 modern RLHF/GRPO/RLVR 의 reward design 의 foundation 의. ## 매 핵심 ### 매 핵심 theorem (Ng et al. 1999) - Shaped reward: r'(s, a, s') = r(s, a, s') + F(s, s'). - F(s, s') = γ·Φ(s') − Φ(s) (potential-based) → policy invariance guaranteed. - 의 X 가 well-defined Φ — 매 arbitrary bonus 의 optimal policy 의 distort 의 가능. ### 매 shaping types - **Potential-based** (theory-safe): heuristic value Φ(s). - **Curiosity / intrinsic motivation**: ICM, RND — exploration bonus. - **Demonstrations (LfD)**: shaped reward from expert similarity. - **Curriculum**: progressively harder targets. - **RLHF reward model**: human-trained dense reward. - **RLVR (verifiable)**: rule-based pass/fail (math, code) — sparse but exact. - **GRPO advantages** (DeepSeek 2024-25): group-relative normalization replaces critic. ### 매 응용 1. Sparse-reward locomotion / manipulation. 2. Game RL (StarCraft II, Atari hard-exploration). 3. RLHF for LLM alignment. 4. RLVR/GRPO for math/code (DeepSeek-R1, o1). 5. Robotics imitation + RL hybrid. ## 💻 패턴 ### Potential-Based Shaping (Ng 1999) ```python def potential(state) -> float: """매 heuristic 의 — e.g. 의 distance-to-goal.""" return -goal_distance(state) def shaped_reward(r, s, s_next, gamma=0.99): return r + gamma * potential(s_next) - potential(s) ``` ### Curiosity-Driven (RND) ```python import torch import torch.nn as nn class RND(nn.Module): def __init__(self, obs_dim, feat_dim=128): super().__init__() self.target = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, feat_dim)) for p in self.target.parameters(): p.requires_grad_(False) self.predictor = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, feat_dim)) def intrinsic(self, obs): return ((self.predictor(obs) - self.target(obs)) ** 2).mean(-1) ``` ### Curriculum Reward ```python def curriculum_target(episode_idx, easy_target, hard_target, ramp_episodes=10000): t = min(episode_idx / ramp_episodes, 1.0) return easy_target + t * (hard_target - easy_target) ``` ### RLHF Reward Model ```python import torch.nn as nn from transformers import AutoModel class RewardModel(nn.Module): def __init__(self, base="meta-llama/Llama-3-8b"): super().__init__() self.backbone = AutoModel.from_pretrained(base) self.head = nn.Linear(self.backbone.config.hidden_size, 1) def forward(self, input_ids, attn): out = self.backbone(input_ids, attn).last_hidden_state last = out[:, -1] return self.head(last).squeeze(-1) # Bradley-Terry pairwise loss def bt_loss(r_chosen, r_rejected): return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean() ``` ### RLVR — Verifiable Rule Reward ```python def rlvr_reward(generated: str, gold: str, task: str) -> float: if task == "math": return 1.0 if extract_answer(generated) == gold else 0.0 elif task == "code": return float(run_unit_tests(generated)) elif task == "format": return 1.0 if has_required_tags(generated) else 0.0 ``` ### GRPO Advantage (DeepSeek 2024) ```python import numpy as np def grpo_advantages(group_rewards: np.ndarray) -> np.ndarray: """매 group-relative normalization — critic 의 X.""" mean = group_rewards.mean() std = group_rewards.std() + 1e-8 return (group_rewards - mean) / std # Usage: sample G=8 outputs per prompt, compute rewards, normalize within group ``` ### Combined Shaping ```python def combined_reward(r_env, s, s_next, model, obs, gamma=0.99, pot_w=1.0, cur_w=0.1): pot = gamma * potential(s_next) - potential(s) cur = model.intrinsic(obs).item() return r_env + pot_w * pot + cur_w * cur ``` ### Reward Hacking Detector ```python def detect_hacking(rewards, true_returns, window=100): """매 reward 의 up 의 X 의 true return 의 stagnant → hacking.""" if len(rewards) < window: return False rew_trend = np.polyfit(range(window), rewards[-window:], 1)[0] ret_trend = np.polyfit(range(window), true_returns[-window:], 1)[0] return rew_trend > 0.01 and ret_trend < 0 ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Sparse reward, known heuristic | Potential-based shaping | | Hard exploration | RND / ICM curiosity | | Have expert demos | LfD-shaped reward + BC pretrain | | LLM alignment, subjective | RLHF reward model | | LLM math/code | RLVR (rule-based) + GRPO | | Robotic manipulation | Combined: potential + curiosity + demo | **기본값**: Potential-based primary; RLVR + GRPO 의 LLM verifiable tasks 의; RLHF 의 subjective tasks 의. ## 🔗 Graph - 부모: [[Reinforcement Learning]] · [[Reward Design]] - 변형: [[GRPO]] · [[RLHF]] - Adjacent: [[Reward Prediction Error]] ## 🤖 LLM 활용 **언제**: reward model training (RLHF), reward function code generation, reward hacking analysis from logs. **언제 X**: LLM 의 reward function 의 propose 의 hacking 의 prone 의 — verify 의 with controlled rollouts. ## ❌ 안티패턴 - **Non-potential bonus**: arbitrary +10 의 sub-goal 의 reach → optimal policy 의 distort. - **Reward hacking ignored**: cumulative reward up 의 task fail 의 monitor 의 X. - **Over-shaping**: dense bonus 의 overwhelm sparse signal → agent 의 task 의 ignore. - **Static curriculum**: agent 의 surpass 의 still serving easy targets. - **No baseline check**: shaping with vs without 의 ablation 의 X — actual gain unknown. ## 🧪 검증 / 중복 - Verified (Ng/Harada/Russell 1999 ICML; DeepSeek-R1 paper 2025; Sutton & Barto Ch 17). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — potential-based + RND + RLHF + GRPO + RLVR |