"매 sparse reward → dense intermediate signal — without changing optimal policy.". Ng, Harada, Russell 1999 ("Policy Invariance Under Reward Transformations") 의 prove 의 매 potential-based shaping F(s,s') = γΦ(s') − Φ(s) 가 optimal policy 의 preserve, 매 modern RLHF/GRPO/RLVR 의 reward design 의 foundation 의.
매 핵심
매 핵심 theorem (Ng et al. 1999)
Shaped reward: r'(s, a, s') = r(s, a, s') + F(s, s').
defpotential(state)->float:"""매 heuristic 의 — e.g. 의 distance-to-goal."""return-goal_distance(state)defshaped_reward(r,s,s_next,gamma=0.99):returnr+gamma*potential(s_next)-potential(s)
importnumpyasnpdefgrpo_advantages(group_rewards:np.ndarray)->np.ndarray:"""매 group-relative normalization — critic 의 X."""mean=group_rewards.mean()std=group_rewards.std()+1e-8return(group_rewards-mean)/std# Usage: sample G=8 outputs per prompt, compute rewards, normalize within group
defdetect_hacking(rewards,true_returns,window=100):"""매 reward 의 up 의 X 의 true return 의 stagnant → hacking."""iflen(rewards)<window:returnFalserew_trend=np.polyfit(range(window),rewards[-window:],1)[0]ret_trend=np.polyfit(range(window),true_returns[-window:],1)[0]returnrew_trend>0.01andret_trend<0
매 결정 기준
상황
Approach
Sparse reward, known heuristic
Potential-based shaping
Hard exploration
RND / ICM curiosity
Have expert demos
LfD-shaped reward + BC pretrain
LLM alignment, subjective
RLHF reward model
LLM math/code
RLVR (rule-based) + GRPO
Robotic manipulation
Combined: potential + curiosity + demo
기본값: Potential-based primary; RLVR + GRPO 의 LLM verifiable tasks 의; RLHF 의 subjective tasks 의.
언제: reward model training (RLHF), reward function code generation, reward hacking analysis from logs.
언제 X: LLM 의 reward function 의 propose 의 hacking 의 prone 의 — verify 의 with controlled rollouts.
❌ 안티패턴
Non-potential bonus: arbitrary +10 의 sub-goal 의 reach → optimal policy 의 distort.
Reward hacking ignored: cumulative reward up 의 task fail 의 monitor 의 X.
Over-shaping: dense bonus 의 overwhelm sparse signal → agent 의 task 의 ignore.
Static curriculum: agent 의 surpass 의 still serving easy targets.
No baseline check: shaping with vs without 의 ablation 의 X — actual gain unknown.