Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

6.7 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Reward Shaping in RL

매 한 줄

"매 sparse reward → dense intermediate signal — without changing optimal policy.". Ng, Harada, Russell 1999 ("Policy Invariance Under Reward Transformations") 의 prove 의 매 potential-based shaping F(s,s') = γΦ(s') − Φ(s) 가 optimal policy 의 preserve, 매 modern RLHF/GRPO/RLVR 의 reward design 의 foundation 의.

매 핵심

매 핵심 theorem (Ng et al. 1999)

Shaped reward: r'(s, a, s') = r(s, a, s') + F(s, s').
F(s, s') = γ·Φ(s') − Φ(s) (potential-based) → policy invariance guaranteed.
의 X 가 well-defined Φ — 매 arbitrary bonus 의 optimal policy 의 distort 의 가능.

매 shaping types

Potential-based (theory-safe): heuristic value Φ(s).
Curiosity / intrinsic motivation: ICM, RND — exploration bonus.
Demonstrations (LfD): shaped reward from expert similarity.
Curriculum: progressively harder targets.
RLHF reward model: human-trained dense reward.
RLVR (verifiable): rule-based pass/fail (math, code) — sparse but exact.
GRPO advantages (DeepSeek 2024-25): group-relative normalization replaces critic.

매 응용

Sparse-reward locomotion / manipulation.
Game RL (StarCraft II, Atari hard-exploration).
RLHF for LLM alignment.
RLVR/GRPO for math/code (DeepSeek-R1, o1).
Robotics imitation + RL hybrid.

💻 패턴

Potential-Based Shaping (Ng 1999)

def potential(state) -> float:
    """매 heuristic 의 — e.g. 의 distance-to-goal."""
    return -goal_distance(state)

def shaped_reward(r, s, s_next, gamma=0.99):
    return r + gamma * potential(s_next) - potential(s)

Curiosity-Driven (RND)

import torch
import torch.nn as nn

class RND(nn.Module):
    def __init__(self, obs_dim, feat_dim=128):
        super().__init__()
        self.target = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
                                    nn.Linear(256, feat_dim))
        for p in self.target.parameters(): p.requires_grad_(False)
        self.predictor = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(),
                                       nn.Linear(256, feat_dim))

    def intrinsic(self, obs):
        return ((self.predictor(obs) - self.target(obs)) ** 2).mean(-1)

Curriculum Reward

def curriculum_target(episode_idx, easy_target, hard_target, ramp_episodes=10000):
    t = min(episode_idx / ramp_episodes, 1.0)
    return easy_target + t * (hard_target - easy_target)

RLHF Reward Model

import torch.nn as nn
from transformers import AutoModel

class RewardModel(nn.Module):
    def __init__(self, base="meta-llama/Llama-3-8b"):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(base)
        self.head = nn.Linear(self.backbone.config.hidden_size, 1)

    def forward(self, input_ids, attn):
        out = self.backbone(input_ids, attn).last_hidden_state
        last = out[:, -1]
        return self.head(last).squeeze(-1)

# Bradley-Terry pairwise loss
def bt_loss(r_chosen, r_rejected):
    return -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()

RLVR — Verifiable Rule Reward

def rlvr_reward(generated: str, gold: str, task: str) -> float:
    if task == "math":
        return 1.0 if extract_answer(generated) == gold else 0.0
    elif task == "code":
        return float(run_unit_tests(generated))
    elif task == "format":
        return 1.0 if has_required_tags(generated) else 0.0

GRPO Advantage (DeepSeek 2024)

import numpy as np

def grpo_advantages(group_rewards: np.ndarray) -> np.ndarray:
    """매 group-relative normalization — critic 의 X."""
    mean = group_rewards.mean()
    std = group_rewards.std() + 1e-8
    return (group_rewards - mean) / std

# Usage: sample G=8 outputs per prompt, compute rewards, normalize within group

Combined Shaping

def combined_reward(r_env, s, s_next, model, obs, gamma=0.99,
                    pot_w=1.0, cur_w=0.1):
    pot = gamma * potential(s_next) - potential(s)
    cur = model.intrinsic(obs).item()
    return r_env + pot_w * pot + cur_w * cur

Reward Hacking Detector

def detect_hacking(rewards, true_returns, window=100):
    """매 reward 의 up 의 X 의 true return 의 stagnant → hacking."""
    if len(rewards) < window: return False
    rew_trend = np.polyfit(range(window), rewards[-window:], 1)[0]
    ret_trend = np.polyfit(range(window), true_returns[-window:], 1)[0]
    return rew_trend > 0.01 and ret_trend < 0

매 결정 기준

상황	Approach
Sparse reward, known heuristic	Potential-based shaping
Hard exploration	RND / ICM curiosity
Have expert demos	LfD-shaped reward + BC pretrain
LLM alignment, subjective	RLHF reward model
LLM math/code	RLVR (rule-based) + GRPO
Robotic manipulation	Combined: potential + curiosity + demo

기본값: Potential-based primary; RLVR + GRPO 의 LLM verifiable tasks 의; RLHF 의 subjective tasks 의.

🔗 Graph

부모: Reinforcement Learning · Reward Design
변형: Potential-Based Shaping · Curiosity-Driven · GRPO · RLHF · RLVR
응용: Sparse Reward · Hard Exploration · LLM Post-Training
Adjacent: Reward Prediction Error · Inverse RL · Imitation Learning

🤖 LLM 활용

언제: reward model training (RLHF), reward function code generation, reward hacking analysis from logs. 언제 X: LLM 의 reward function 의 propose 의 hacking 의 prone 의 — verify 의 with controlled rollouts.

❌ 안티패턴

Non-potential bonus: arbitrary +10 의 sub-goal 의 reach → optimal policy 의 distort.
Reward hacking ignored: cumulative reward up 의 task fail 의 monitor 의 X.
Over-shaping: dense bonus 의 overwhelm sparse signal → agent 의 task 의 ignore.
Static curriculum: agent 의 surpass 의 still serving easy targets.
No baseline check: shaping with vs without 의 ablation 의 X — actual gain unknown.

🧪 검증 / 중복

Verified (Ng/Harada/Russell 1999 ICML; DeepSeek-R1 paper 2025; Sutton & Barto Ch 17).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — potential-based + RND + RLHF + GRPO + RLVR

6.7 KiB Raw Blame History Unescape Escape