f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.5 KiB
6.5 KiB
id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
| id | title | category | status | canonical_id | aliases | duplicate_of | source_trust_level | confidence_score | verification_status | tags | raw_sources | last_reinforced | github_commit | tech_stack | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-2026-0508-reward-shaping-in-rl | Reward Shaping in RL | 10_Wiki/Topics | verified | self |
|
none | A | 0.95 | applied |
|
2026-05-10 | pending |
|
Reward Shaping in RL
매 한 줄
"매 sparse reward → dense intermediate signal — without changing optimal policy.". Ng, Harada, Russell 1999 ("Policy Invariance Under Reward Transformations") 의 prove 의 매 potential-based shaping F(s,s') = γΦ(s') − Φ(s) 가 optimal policy 의 preserve, 매 modern RLHF/GRPO/RLVR 의 reward design 의 foundation 의.
매 핵심
매 핵심 theorem (Ng et al. 1999)
- Shaped reward: r'(s, a, s') = r(s, a, s') + F(s, s').
- F(s, s') = γ·Φ(s') − Φ(s) (potential-based) → policy invariance guaranteed.
- 의 X 가 well-defined Φ — 매 arbitrary bonus 의 optimal policy 의 distort 의 가능.
매 shaping types
- Potential-based (theory-safe): heuristic value Φ(s).
- Curiosity / intrinsic motivation: ICM, RND — exploration bonus.
- Demonstrations (LfD): shaped reward from expert similarity.
- Curriculum: progressively harder targets.
- RLHF reward model: human-trained dense reward.
- RLVR (verifiable): rule-based pass/fail (math, code) — sparse but exact.
- GRPO advantages (DeepSeek 2024-25): group-relative normalization replaces critic.
매 응용
- Sparse-reward locomotion / manipulation.
- Game RL (StarCraft II, Atari hard-exploration).
- RLHF for LLM alignment.
- RLVR/GRPO for math/code (DeepSeek-R1, o1).
- Robotics imitation + RL hybrid.
💻 패턴
Potential-Based Shaping (Ng 1999)
def potential(state) -> float:
"""매 heuristic 의 — e.g. 의 distance-to-goal."""
return -goal_distance(state)
def shaped_reward(r, s, s_next, gamma=0.99):
return r + gamma * potential(s_next) - potential(s)
Curiosity-Driven (RND)
import torch
import torch.nn as nn
class RND(nn.Module):
def __init__(self, obs_dim, feat_dim=128):
super().__init__()
self.target = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
nn.Linear(256, feat_dim))
for p in self.target.parameters(): p.requires_grad_(False)
self.predictor = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
nn.Linear(256, feat_dim))
def intrinsic(self, obs):
return ((self.predictor(obs) - self.target(obs)) ** 2).mean(-1)
Curriculum Reward
def curriculum_target(episode_idx, easy_target, hard_target, ramp_episodes=10000):
t = min(episode_idx / ramp_episodes, 1.0)
return easy_target + t * (hard_target - easy_target)
RLHF Reward Model
import torch.nn as nn
from transformers import AutoModel
class RewardModel(nn.Module):
def __init__(self, base="meta-llama/Llama-3-8b"):
super().__init__()
self.backbone = AutoModel.from_pretrained(base)
self.head = nn.Linear(self.backbone.config.hidden_size, 1)
def forward(self, input_ids, attn):
out = self.backbone(input_ids, attn).last_hidden_state
last = out[:, -1]
return self.head(last).squeeze(-1)
# Bradley-Terry pairwise loss
def bt_loss(r_chosen, r_rejected):
return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()
RLVR — Verifiable Rule Reward
def rlvr_reward(generated: str, gold: str, task: str) -> float:
if task == "math":
return 1.0 if extract_answer(generated) == gold else 0.0
elif task == "code":
return float(run_unit_tests(generated))
elif task == "format":
return 1.0 if has_required_tags(generated) else 0.0
GRPO Advantage (DeepSeek 2024)
import numpy as np
def grpo_advantages(group_rewards: np.ndarray) -> np.ndarray:
"""매 group-relative normalization — critic 의 X."""
mean = group_rewards.mean()
std = group_rewards.std() + 1e-8
return (group_rewards - mean) / std
# Usage: sample G=8 outputs per prompt, compute rewards, normalize within group
Combined Shaping
def combined_reward(r_env, s, s_next, model, obs, gamma=0.99,
pot_w=1.0, cur_w=0.1):
pot = gamma * potential(s_next) - potential(s)
cur = model.intrinsic(obs).item()
return r_env + pot_w * pot + cur_w * cur
Reward Hacking Detector
def detect_hacking(rewards, true_returns, window=100):
"""매 reward 의 up 의 X 의 true return 의 stagnant → hacking."""
if len(rewards) < window: return False
rew_trend = np.polyfit(range(window), rewards[-window:], 1)[0]
ret_trend = np.polyfit(range(window), true_returns[-window:], 1)[0]
return rew_trend > 0.01 and ret_trend < 0
매 결정 기준
| 상황 | Approach |
|---|---|
| Sparse reward, known heuristic | Potential-based shaping |
| Hard exploration | RND / ICM curiosity |
| Have expert demos | LfD-shaped reward + BC pretrain |
| LLM alignment, subjective | RLHF reward model |
| LLM math/code | RLVR (rule-based) + GRPO |
| Robotic manipulation | Combined: potential + curiosity + demo |
기본값: Potential-based primary; RLVR + GRPO 의 LLM verifiable tasks 의; RLHF 의 subjective tasks 의.
🔗 Graph
- 부모: Reinforcement Learning · Reward Design
- 변형: GRPO · RLHF
- Adjacent: Reward Prediction Error
🤖 LLM 활용
언제: reward model training (RLHF), reward function code generation, reward hacking analysis from logs. 언제 X: LLM 의 reward function 의 propose 의 hacking 의 prone 의 — verify 의 with controlled rollouts.
❌ 안티패턴
- Non-potential bonus: arbitrary +10 의 sub-goal 의 reach → optimal policy 의 distort.
- Reward hacking ignored: cumulative reward up 의 task fail 의 monitor 의 X.
- Over-shaping: dense bonus 의 overwhelm sparse signal → agent 의 task 의 ignore.
- Static curriculum: agent 의 surpass 의 still serving easy targets.
- No baseline check: shaping with vs without 의 ablation 의 X — actual gain unknown.
🧪 검증 / 중복
- Verified (Ng/Harada/Russell 1999 ICML; DeepSeek-R1 paper 2025; Sutton & Barto Ch 17).
- 신뢰도 A.
🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — potential-based + RND + RLHF + GRPO + RLVR |