Files
2nd/10_Wiki/Topics/AI_and_ML/Reward Prediction Error.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

5.3 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-reward-prediction-error Reward Prediction Error 10_Wiki/Topics verified self
RPE
TD Error
Dopamine Prediction Error
none A 0.95 applied
neuroscience
reinforcement-learning
dopamine
td-learning
2026-05-10 pending
language framework
Python PyTorch/Gymnasium

Reward Prediction Error

매 한 줄

"매 actual reward minus predicted reward — 매 학습 의 driver.". Wolfram Schultz 의 1997 dopamine experiments 가 monkey VTA neuron 의 firing 가 TD-error δ = r + γV(s') V(s) 와 매 same signature 의 보임. 매 neuroscience 와 RL 의 connect 의 historic moment, 매 modern dopamine RL theory 의 foundation.

매 핵심

매 RPE definition

  • δ_t = r_t + γ·V(s_{t+1}) V(s_t) — temporal difference error.
  • δ > 0: better than expected → strengthen association.
  • δ = 0: as predicted → no learning needed.
  • δ < 0: worse than expected → weaken / extinct.

매 dopamine signature (Schultz 1997)

  1. Untrained: dopamine neurons fire at reward delivery.
  2. After conditioning: fire at predictive cue, 의 X reward.
  3. Cue followed by no reward: firing dips below baseline (negative RPE).
  4. 매 exact pattern 의 TD-error δ 와 match.

매 응용

  1. Classical/operant conditioning models.
  2. RL algorithms (TD, Q-learning, Actor-Critic).
  3. Addiction theory (drugs hijack RPE signal).
  4. RLHF reward model interpretation (LLM training).

💻 패턴

TD(0) Update

def td_update(V, s, r, s_next, alpha=0.1, gamma=0.99):
    rpe = r + gamma * V[s_next] - V[s]   # 매 RPE
    V[s] += alpha * rpe
    return V, rpe

Q-Learning RPE

import numpy as np

def q_learning_step(Q, s, a, r, s_next, alpha=0.1, gamma=0.99):
    target = r + gamma * np.max(Q[s_next])
    rpe = target - Q[s, a]
    Q[s, a] += alpha * rpe
    return Q, rpe

Actor-Critic with RPE as Advantage

import torch

def actor_critic_step(actor, critic, opt_a, opt_c, s, a, r, s_next, gamma=0.99):
    v_s, v_next = critic(s), critic(s_next).detach()
    rpe = r + gamma * v_next - v_s        # 매 RPE = advantage

    critic_loss = rpe.pow(2)
    log_prob = actor(s).log_prob(a)
    actor_loss = -(log_prob * rpe.detach())

    opt_c.zero_grad(); critic_loss.backward(); opt_c.step()
    opt_a.zero_grad(); actor_loss.backward(); opt_a.step()
    return rpe.item()

Distributional RPE (C51-style)

# 매 modern: scalar RPE 의 X, 의 reward distribution.
def distributional_td_target(r, p_next, support, gamma=0.99):
    """p_next: prob over atoms; support: atom values."""
    Tz = r + gamma * support      # shifted support
    return Tz, p_next             # project onto original support next

RLHF reward model RPE

def rlhf_advantage(rewards, values, gamma=1.0, lam=0.95):
    """GAE — generalized advantage estimation. Each step δ_t = RPE."""
    advantages = []
    gae = 0
    for t in reversed(range(len(rewards))):
        v_next = values[t + 1] if t + 1 < len(values) else 0
        delta = rewards[t] + gamma * v_next - values[t]   # 매 RPE
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)
    return advantages

Phasic vs Tonic Dopamine simulation

def simulate_dopamine(trial, cue_time, reward_time, predicted=True):
    """Phasic burst at predictive cue (after learning); dip at omitted reward."""
    signal = []
    for t in range(trial):
        if t == cue_time and predicted: signal.append(+1.0)   # phasic burst
        elif t == reward_time and not predicted: signal.append(+1.0)
        elif t == reward_time and predicted: signal.append(0.0)
        else: signal.append(0.05)   # tonic baseline
    return signal

매 결정 기준

상황 Approach
Tabular small state space TD(0) / Q-learning
Continuous state, value-based DQN (RPE = TD target Q)
Policy + value Actor-Critic with RPE as advantage
Distribution matters Distributional RL (C51, QR-DQN)
LLM RLHF PPO with GAE — RPE summed

기본값: PPO + GAE — 매 modern RPE 의 actor-critic instantiation.

🔗 Graph

🤖 LLM 활용

언제: RLHF/DPO/GRPO 의 advantage computation 의 understand, 의 reward model debugging. 언제 X: LLM 의 의 RPE 의 conceptual explanation 의 helpful 의 X — 의 raw neural data 의 X.

안티패턴

  • Confusing reward and RPE: r 의 X RPE — RPE = r prediction.
  • Always positive RPE: 의 X — negative RPE (omission) 의 critical for extinction learning.
  • Ignoring discount: γ 의 omit 의 X — temporal credit assignment 의 broken.
  • Dopamine = pleasure: 의 X — dopamine 의 reward signal 의 X, 의 prediction error 의.

🧪 검증 / 중복

  • Verified (Schultz, Dayan, Montague 1997 Science; Sutton & Barto Ch 6).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — RPE neuroscience + RL bridge