Files

T

koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)

이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-08 12:24:15 +09:00

5.3 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Reward Prediction Error

매 한 줄

"매 actual reward minus predicted reward — 매 학습 의 driver.". Wolfram Schultz 의 1997 dopamine experiments 가 monkey VTA neuron 의 firing 가 TD-error δ = r + γV(s') − V(s) 와 매 same signature 의 보임. 매 neuroscience 와 RL 의 connect 의 historic moment, 매 modern dopamine RL theory 의 foundation.

매 핵심

매 RPE definition

δ_t = r_t + γ·V(s_{t+1}) − V(s_t) — temporal difference error.
δ > 0: better than expected → strengthen association.
δ = 0: as predicted → no learning needed.
δ < 0: worse than expected → weaken / extinct.

매 dopamine signature (Schultz 1997)

Untrained: dopamine neurons fire at reward delivery.
After conditioning: fire at predictive cue, 의 X reward.
Cue followed by no reward: firing dips below baseline (negative RPE).
매 exact pattern 의 TD-error δ 와 match.

매 응용

Classical/operant conditioning models.
RL algorithms (TD, Q-learning, Actor-Critic).
Addiction theory (drugs hijack RPE signal).
RLHF reward model interpretation (LLM training).

💻 패턴

TD(0) Update

def td_update(V, s, r, s_next, alpha=0.1, gamma=0.99):
    rpe = r + gamma * V[s_next] - V[s]   # 매 RPE
    V[s] += alpha * rpe
    return V, rpe

Q-Learning RPE

import numpy as np

def q_learning_step(Q, s, a, r, s_next, alpha=0.1, gamma=0.99):
    target = r + gamma * np.max(Q[s_next])
    rpe = target - Q[s, a]
    Q[s, a] += alpha * rpe
    return Q, rpe

Actor-Critic with RPE as Advantage

import torch

def actor_critic_step(actor, critic, opt_a, opt_c, s, a, r, s_next, gamma=0.99):
    v_s, v_next = critic(s), critic(s_next).detach()
    rpe = r + gamma * v_next - v_s        # 매 RPE = advantage

    critic_loss = rpe.pow(2)
    log_prob = actor(s).log_prob(a)
    actor_loss = -(log_prob * rpe.detach())

    opt_c.zero_grad(); critic_loss.backward(); opt_c.step()
    opt_a.zero_grad(); actor_loss.backward(); opt_a.step()
    return rpe.item()

Distributional RPE (C51-style)

# 매 modern: scalar RPE 의 X, 의 reward distribution.
def distributional_td_target(r, p_next, support, gamma=0.99):
    """p_next: prob over atoms; support: atom values."""
    Tz = r + gamma * support      # shifted support
    return Tz, p_next             # project onto original support next

RLHF reward model RPE

def rlhf_advantage(rewards, values, gamma=1.0, lam=0.95):
    """GAE — generalized advantage estimation. Each step δ_t = RPE."""
    advantages = []
    gae = 0
    for t in reversed(range(len(rewards))):
        v_next = values[t + 1] if t + 1 < len(values) else 0
        delta = rewards[t] + gamma * v_next - values[t]   # 매 RPE
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)
    return advantages

Phasic vs Tonic Dopamine simulation

def simulate_dopamine(trial, cue_time, reward_time, predicted=True):
    """Phasic burst at predictive cue (after learning); dip at omitted reward."""
    signal = []
    for t in range(trial):
        if t == cue_time and predicted: signal.append(+1.0)   # phasic burst
        elif t == reward_time and not predicted: signal.append(+1.0)
        elif t == reward_time and predicted: signal.append(0.0)
        else: signal.append(0.05)   # tonic baseline
    return signal

매 결정 기준

상황	Approach
Tabular small state space	TD(0) / Q-learning
Continuous state, value-based	DQN (RPE = TD target − Q)
Policy + value	Actor-Critic with RPE as advantage
Distribution matters	Distributional RL (C51, QR-DQN)
LLM RLHF	PPO with GAE — RPE summed

기본값: PPO + GAE — 매 modern RPE 의 actor-critic instantiation.

🔗 Graph

부모: Reinforcement Learning
변형: TD Learning · Distributional RL
응용: Actor-Critic · RLHF
Adjacent: Dopamine · 데이터 사이언스 및 ML 엔지니어링

🤖 LLM 활용

언제: RLHF/DPO/GRPO 의 advantage computation 의 understand, 의 reward model debugging. 언제 X: LLM 의 의 RPE 의 conceptual explanation 의 helpful 의 X — 의 raw neural data 의 X.

❌ 안티패턴

Confusing reward and RPE: r 의 X RPE — RPE = r − prediction.
Always positive RPE: 의 X — negative RPE (omission) 의 critical for extinction learning.
Ignoring discount: γ 의 omit 의 X — temporal credit assignment 의 broken.
Dopamine = pleasure: 의 X — dopamine 의 reward signal 의 X, 의 prediction error 의.

🧪 검증 / 중복

Verified (Schultz, Dayan, Montague 1997 Science; Sutton & Barto Ch 6).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — RPE neuroscience + RL bridge

5.3 KiB Raw Blame History Unescape Escape