Files
2nd/10_Wiki/Topics/AI_and_ML/Reward-Shaping-in-RL.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

6.5 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-reward-shaping-in-rl Reward Shaping in RL 10_Wiki/Topics verified self
Reward Shaping
Shaped Reward
Dense Reward Design
none A 0.95 applied
reinforcement-learning
reward-design
RLHF
GRPO
sparse-reward
2026-05-10 pending
language framework
Python PyTorch/Gymnasium/TRL

Reward Shaping in RL

매 한 줄

"매 sparse reward → dense intermediate signal — without changing optimal policy.". Ng, Harada, Russell 1999 ("Policy Invariance Under Reward Transformations") 의 prove 의 매 potential-based shaping F(s,s') = γΦ(s') Φ(s) 가 optimal policy 의 preserve, 매 modern RLHF/GRPO/RLVR 의 reward design 의 foundation 의.

매 핵심

매 핵심 theorem (Ng et al. 1999)

  • Shaped reward: r'(s, a, s') = r(s, a, s') + F(s, s').
  • F(s, s') = γ·Φ(s') Φ(s) (potential-based) → policy invariance guaranteed.
  • 의 X 가 well-defined Φ — 매 arbitrary bonus 의 optimal policy 의 distort 의 가능.

매 shaping types

  • Potential-based (theory-safe): heuristic value Φ(s).
  • Curiosity / intrinsic motivation: ICM, RND — exploration bonus.
  • Demonstrations (LfD): shaped reward from expert similarity.
  • Curriculum: progressively harder targets.
  • RLHF reward model: human-trained dense reward.
  • RLVR (verifiable): rule-based pass/fail (math, code) — sparse but exact.
  • GRPO advantages (DeepSeek 2024-25): group-relative normalization replaces critic.

매 응용

  1. Sparse-reward locomotion / manipulation.
  2. Game RL (StarCraft II, Atari hard-exploration).
  3. RLHF for LLM alignment.
  4. RLVR/GRPO for math/code (DeepSeek-R1, o1).
  5. Robotics imitation + RL hybrid.

💻 패턴

Potential-Based Shaping (Ng 1999)

def potential(state) -> float:
    """매 heuristic 의 — e.g. 의 distance-to-goal."""
    return -goal_distance(state)

def shaped_reward(r, s, s_next, gamma=0.99):
    return r + gamma * potential(s_next) - potential(s)

Curiosity-Driven (RND)

import torch
import torch.nn as nn

class RND(nn.Module):
    def __init__(self, obs_dim, feat_dim=128):
        super().__init__()
        self.target = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
                                    nn.Linear(256, feat_dim))
        for p in self.target.parameters(): p.requires_grad_(False)
        self.predictor = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
                                       nn.Linear(256, feat_dim))

    def intrinsic(self, obs):
        return ((self.predictor(obs) - self.target(obs)) ** 2).mean(-1)

Curriculum Reward

def curriculum_target(episode_idx, easy_target, hard_target, ramp_episodes=10000):
    t = min(episode_idx / ramp_episodes, 1.0)
    return easy_target + t * (hard_target - easy_target)

RLHF Reward Model

import torch.nn as nn
from transformers import AutoModel

class RewardModel(nn.Module):
    def __init__(self, base="meta-llama/Llama-3-8b"):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(base)
        self.head = nn.Linear(self.backbone.config.hidden_size, 1)

    def forward(self, input_ids, attn):
        out = self.backbone(input_ids, attn).last_hidden_state
        last = out[:, -1]
        return self.head(last).squeeze(-1)

# Bradley-Terry pairwise loss
def bt_loss(r_chosen, r_rejected):
    return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()

RLVR — Verifiable Rule Reward

def rlvr_reward(generated: str, gold: str, task: str) -> float:
    if task == "math":
        return 1.0 if extract_answer(generated) == gold else 0.0
    elif task == "code":
        return float(run_unit_tests(generated))
    elif task == "format":
        return 1.0 if has_required_tags(generated) else 0.0

GRPO Advantage (DeepSeek 2024)

import numpy as np

def grpo_advantages(group_rewards: np.ndarray) -> np.ndarray:
    """매 group-relative normalization — critic 의 X."""
    mean = group_rewards.mean()
    std = group_rewards.std() + 1e-8
    return (group_rewards - mean) / std

# Usage: sample G=8 outputs per prompt, compute rewards, normalize within group

Combined Shaping

def combined_reward(r_env, s, s_next, model, obs, gamma=0.99,
                    pot_w=1.0, cur_w=0.1):
    pot = gamma * potential(s_next) - potential(s)
    cur = model.intrinsic(obs).item()
    return r_env + pot_w * pot + cur_w * cur

Reward Hacking Detector

def detect_hacking(rewards, true_returns, window=100):
    """매 reward 의 up 의 X 의 true return 의 stagnant → hacking."""
    if len(rewards) < window: return False
    rew_trend = np.polyfit(range(window), rewards[-window:], 1)[0]
    ret_trend = np.polyfit(range(window), true_returns[-window:], 1)[0]
    return rew_trend > 0.01 and ret_trend < 0

매 결정 기준

상황 Approach
Sparse reward, known heuristic Potential-based shaping
Hard exploration RND / ICM curiosity
Have expert demos LfD-shaped reward + BC pretrain
LLM alignment, subjective RLHF reward model
LLM math/code RLVR (rule-based) + GRPO
Robotic manipulation Combined: potential + curiosity + demo

기본값: Potential-based primary; RLVR + GRPO 의 LLM verifiable tasks 의; RLHF 의 subjective tasks 의.

🔗 Graph

🤖 LLM 활용

언제: reward model training (RLHF), reward function code generation, reward hacking analysis from logs. 언제 X: LLM 의 reward function 의 propose 의 hacking 의 prone 의 — verify 의 with controlled rollouts.

안티패턴

  • Non-potential bonus: arbitrary +10 의 sub-goal 의 reach → optimal policy 의 distort.
  • Reward hacking ignored: cumulative reward up 의 task fail 의 monitor 의 X.
  • Over-shaping: dense bonus 의 overwhelm sparse signal → agent 의 task 의 ignore.
  • Static curriculum: agent 의 surpass 의 still serving easy targets.
  • No baseline check: shaping with vs without 의 ablation 의 X — actual gain unknown.

🧪 검증 / 중복

  • Verified (Ng/Harada/Russell 1999 ICML; DeepSeek-R1 paper 2025; Sutton & Barto Ch 17).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — potential-based + RND + RLHF + GRPO + RLVR