--- id: wiki-2026-0508-reward-prediction-error title: Reward Prediction Error category: 10_Wiki/Topics status: verified canonical_id: self aliases: [RPE, TD Error, Dopamine Prediction Error] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [neuroscience, reinforcement-learning, dopamine, td-learning] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch/Gymnasium --- # Reward Prediction Error ## 매 한 줄 > **"매 actual reward minus predicted reward — 매 학습 의 driver."**. Wolfram Schultz 의 1997 dopamine experiments 가 monkey VTA neuron 의 firing 가 TD-error δ = r + γV(s') − V(s) 와 매 same signature 의 보임. 매 neuroscience 와 RL 의 connect 의 historic moment, 매 modern dopamine RL theory 의 foundation. ## 매 핵심 ### 매 RPE definition - **δ_t = r_t + γ·V(s_{t+1}) − V(s_t)** — temporal difference error. - δ > 0: better than expected → strengthen association. - δ = 0: as predicted → no learning needed. - δ < 0: worse than expected → weaken / extinct. ### 매 dopamine signature (Schultz 1997) 1. Untrained: dopamine neurons fire at reward delivery. 2. After conditioning: fire at predictive cue, 의 X reward. 3. Cue followed by no reward: firing dips below baseline (negative RPE). 4. 매 exact pattern 의 TD-error δ 와 match. ### 매 응용 1. Classical/operant conditioning models. 2. RL algorithms (TD, Q-learning, Actor-Critic). 3. Addiction theory (drugs hijack RPE signal). 4. RLHF reward model interpretation (LLM training). ## 💻 패턴 ### TD(0) Update ```python def td_update(V, s, r, s_next, alpha=0.1, gamma=0.99): rpe = r + gamma * V[s_next] - V[s] # 매 RPE V[s] += alpha * rpe return V, rpe ``` ### Q-Learning RPE ```python import numpy as np def q_learning_step(Q, s, a, r, s_next, alpha=0.1, gamma=0.99): target = r + gamma * np.max(Q[s_next]) rpe = target - Q[s, a] Q[s, a] += alpha * rpe return Q, rpe ``` ### Actor-Critic with RPE as Advantage ```python import torch def actor_critic_step(actor, critic, opt_a, opt_c, s, a, r, s_next, gamma=0.99): v_s, v_next = critic(s), critic(s_next).detach() rpe = r + gamma * v_next - v_s # 매 RPE = advantage critic_loss = rpe.pow(2) log_prob = actor(s).log_prob(a) actor_loss = -(log_prob * rpe.detach()) opt_c.zero_grad(); critic_loss.backward(); opt_c.step() opt_a.zero_grad(); actor_loss.backward(); opt_a.step() return rpe.item() ``` ### Distributional RPE (C51-style) ```python # 매 modern: scalar RPE 의 X, 의 reward distribution. def distributional_td_target(r, p_next, support, gamma=0.99): """p_next: prob over atoms; support: atom values.""" Tz = r + gamma * support # shifted support return Tz, p_next # project onto original support next ``` ### RLHF reward model RPE ```python def rlhf_advantage(rewards, values, gamma=1.0, lam=0.95): """GAE — generalized advantage estimation. Each step δ_t = RPE.""" advantages = [] gae = 0 for t in reversed(range(len(rewards))): v_next = values[t + 1] if t + 1 < len(values) else 0 delta = rewards[t] + gamma * v_next - values[t] # 매 RPE gae = delta + gamma * lam * gae advantages.insert(0, gae) return advantages ``` ### Phasic vs Tonic Dopamine simulation ```python def simulate_dopamine(trial, cue_time, reward_time, predicted=True): """Phasic burst at predictive cue (after learning); dip at omitted reward.""" signal = [] for t in range(trial): if t == cue_time and predicted: signal.append(+1.0) # phasic burst elif t == reward_time and not predicted: signal.append(+1.0) elif t == reward_time and predicted: signal.append(0.0) else: signal.append(0.05) # tonic baseline return signal ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Tabular small state space | TD(0) / Q-learning | | Continuous state, value-based | DQN (RPE = TD target − Q) | | Policy + value | Actor-Critic with RPE as advantage | | Distribution matters | Distributional RL (C51, QR-DQN) | | LLM RLHF | PPO with GAE — RPE summed | **기본값**: PPO + GAE — 매 modern RPE 의 actor-critic instantiation. ## 🔗 Graph - 부모: [[Reinforcement Learning]] - 변형: [[TD Learning]] · [[Distributional RL]] - 응용: [[Actor-Critic]] · [[RLHF]] - Adjacent: [[Dopamine]] · [[데이터 사이언스 및 ML 엔지니어링|Bellman Equation]] ## 🤖 LLM 활용 **언제**: RLHF/DPO/GRPO 의 advantage computation 의 understand, 의 reward model debugging. **언제 X**: LLM 의 의 RPE 의 conceptual explanation 의 helpful 의 X — 의 raw neural data 의 X. ## ❌ 안티패턴 - **Confusing reward and RPE**: r 의 X RPE — RPE = r − prediction. - **Always positive RPE**: 의 X — negative RPE (omission) 의 critical for extinction learning. - **Ignoring discount**: γ 의 omit 의 X — temporal credit assignment 의 broken. - **Dopamine = pleasure**: 의 X — dopamine 의 reward signal 의 X, 의 prediction error 의. ## 🧪 검증 / 중복 - Verified (Schultz, Dayan, Montague 1997 Science; Sutton & Barto Ch 6). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — RPE neuroscience + RL bridge |