"매 actual reward minus predicted reward — 매 학습 의 driver.". Wolfram Schultz 의 1997 dopamine experiments 가 monkey VTA neuron 의 firing 가 TD-error δ = r + γV(s') − V(s) 와 매 same signature 의 보임. 매 neuroscience 와 RL 의 connect 의 historic moment, 매 modern dopamine RL theory 의 foundation.
importtorchdefactor_critic_step(actor,critic,opt_a,opt_c,s,a,r,s_next,gamma=0.99):v_s,v_next=critic(s),critic(s_next).detach()rpe=r+gamma*v_next-v_s# 매 RPE = advantagecritic_loss=rpe.pow(2)log_prob=actor(s).log_prob(a)actor_loss=-(log_prob*rpe.detach())opt_c.zero_grad();critic_loss.backward();opt_c.step()opt_a.zero_grad();actor_loss.backward();opt_a.step()returnrpe.item()
Distributional RPE (C51-style)
# 매 modern: scalar RPE 의 X, 의 reward distribution.defdistributional_td_target(r,p_next,support,gamma=0.99):"""p_next: prob over atoms; support: atom values."""Tz=r+gamma*support# shifted supportreturnTz,p_next# project onto original support next
RLHF reward model RPE
defrlhf_advantage(rewards,values,gamma=1.0,lam=0.95):"""GAE — generalized advantage estimation. Each step δ_t = RPE."""advantages=[]gae=0fortinreversed(range(len(rewards))):v_next=values[t+1]ift+1<len(values)else0delta=rewards[t]+gamma*v_next-values[t]# 매 RPEgae=delta+gamma*lam*gaeadvantages.insert(0,gae)returnadvantages
Phasic vs Tonic Dopamine simulation
defsimulate_dopamine(trial,cue_time,reward_time,predicted=True):"""Phasic burst at predictive cue (after learning); dip at omitted reward."""signal=[]fortinrange(trial):ift==cue_timeandpredicted:signal.append(+1.0)# phasic burstelift==reward_timeandnotpredicted:signal.append(+1.0)elift==reward_timeandpredicted:signal.append(0.0)else:signal.append(0.05)# tonic baselinereturnsignal
매 결정 기준
상황
Approach
Tabular small state space
TD(0) / Q-learning
Continuous state, value-based
DQN (RPE = TD target − Q)
Policy + value
Actor-Critic with RPE as advantage
Distribution matters
Distributional RL (C51, QR-DQN)
LLM RLHF
PPO with GAE — RPE summed
기본값: PPO + GAE — 매 modern RPE 의 actor-critic instantiation.
언제: RLHF/DPO/GRPO 의 advantage computation 의 understand, 의 reward model debugging.
언제 X: LLM 의 의 RPE 의 conceptual explanation 의 helpful 의 X — 의 raw neural data 의 X.
❌ 안티패턴
Confusing reward and RPE: r 의 X RPE — RPE = r − prediction.
Always positive RPE: 의 X — negative RPE (omission) 의 critical for extinction learning.
Ignoring discount: γ 의 omit 의 X — temporal credit assignment 의 broken.
Dopamine = pleasure: 의 X — dopamine 의 reward signal 의 X, 의 prediction error 의.