---
id: wiki-2026-0508-rl-neuroscience
title: RL Neuroscience
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Reinforcement Learning Neuroscience, Computational Neuroscience of RL, Dopamine RPE]
duplicate_of: none
source_trust_level: A
confidence_score: 0.85
verification_status: applied
tags: [reinforcement-learning, neuroscience, dopamine, computational-neuroscience]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: numpy
---

# RL Neuroscience

## 매 한 줄
> **"매 dopamine = reward prediction error (RPE)"**. Schultz 1997 의 single-cell recording 의 매 TD-learning 의 brain analogue 를 confirm. 매 basal ganglia 의 actor-critic, 매 prefrontal cortex 의 model-based planning. 2026 현재 매 distributional RL (Dabney 2020) 의 dopamine population code 의 confirmation 과 매 deep RL ↔ neuroscience 의 active bridge.

## 매 핵심

### 매 핵심 발견
- **Dopamine = RPE** (Schultz, Dayan, Montague 1997): VTA / SNc dopamine neuron 의 firing 의 (R + γV(s') − V(s)) 의 encoding.
- **Phasic vs tonic**: phasic burst = positive RPE, dip = negative RPE; tonic = uncertainty / motivation.
- **Distributional dopamine** (Dabney/Kurth-Nelson 2020 Nature): different DA neurons 의 different return-distribution quantiles.
- **Basal ganglia 의 actor-critic**: striatum (D1 direct = go, D2 indirect = no-go) = actor, dopamine = critic signal.
- **PFC + hippocampus 의 model-based**: replay, planning, successor representation.

### 매 brain ↔ RL mapping
| Brain | RL concept |
|---|---|
| VTA / SNc dopamine | TD error δ |
| Striatum (D1/D2) | actor / policy |
| Ventral striatum | state value V(s) |
| OFC | expected outcome / Q(s,a) |
| dlPFC | working memory / model-based |
| Hippocampus | successor representation, replay |
| Anterior cingulate | exploration / volatility |

### 매 model-free vs model-based
- **Model-free** (habit, dorsolateral striatum): TD, slow, cached.
- **Model-based** (goal-directed, dorsomedial striatum + PFC): plan, fast adapt, costly.
- **Arbitrator** (Daw 2005): uncertainty-weighted blend — habits 의 trained data 에서 dominate.

### 매 응용
1. Computational psychiatry (addiction, depression, OCD as RL dysfunction).
2. Drug action modeling (cocaine, SSRI, ketamine).
3. Brain-inspired RL (distributional, hierarchical, replay).
4. Neural prosthetics (BCI with RL decoding).

## 💻 패턴

### TD-learning 의 dopamine sim
```python
import numpy as np

def td_value(rewards, gamma=0.9, alpha=0.1):
    V = np.zeros_like(rewards, dtype=float)
    rpes = np.zeros_like(rewards, dtype=float)
    for t in range(len(rewards) - 1):
        rpe = rewards[t] + gamma * V[t+1] - V[t]    # 매 dopamine signal
        V[t] += alpha * rpe
        rpes[t] = rpe
    return V, rpes

# 매 Schultz 1997 의 cue-reward conditioning
trials = []
for trial in range(100):
    seq = np.zeros(10)
    seq[3] = 1.0   # CS at t=3
    seq[7] = 1.0   # reward at t=7
    V, rpes = td_value(seq)
    trials.append(rpes)
# 매 early trials: phasic burst at reward (t=7)
# 매 late trials: burst shifts to CS (t=3) — 매 prediction-error transfer
```

### Distributional TD (Dabney 2020 신경)
```python
# 매 each "DA neuron" 의 own quantile τᵢ ∈ (0,1) 와 asymmetric scaling
def quantile_td(returns, taus, lr=0.05):
    Q = np.zeros_like(taus)
    for r in returns:
        for i, tau in enumerate(taus):
            delta = r - Q[i]
            # 매 asymmetric: positive RPE 의 tau-weighted, negative 의 (1-tau)
            Q[i] += lr * (tau if delta > 0 else (1 - tau)) * delta
    return Q   # 매 distribution-encoding population
```

### Successor representation
```python
def successor_repr(transitions, gamma=0.9):
    n = transitions.shape[0]
    M = np.zeros((n, n))
    for s, sp in transitions:
        M[s] += 0.1 * (np.eye(n)[s] + gamma * M[sp] - M[s])
    return M   # 매 hippocampal SR (Stachenfeld 2017)
```

### Two-step task (Daw 2011 model-based vs model-free)
```python
# 매 stage1: A → 0.7 → S2_left,  0.3 → S2_right
# 매 stage2: reward varies
# 매 model-free: stay if rewarded, regardless of transition
# 매 model-based: stay if rewarded AND transition was common
def two_step_choice(prev_choice, prev_reward, prev_common, w_mb=0.5):
    # 매 w_mb 의 model-based weight
    mf_pref = 1 if prev_reward else -1
    mb_pref = (1 if prev_reward and prev_common else
               1 if not prev_reward and not prev_common else -1)
    score = (1 - w_mb) * mf_pref + w_mb * mb_pref
    return prev_choice if score > 0 else 1 - prev_choice
```

### Volatility-weighted learning rate (Behrens 2007)
```python
# 매 ACC 의 volatility 의 track, 매 high vol → high LR
def volatility_lr(rpes, base_lr=0.05):
    vol = np.var(rpes[-10:])     # rolling variance
    return base_lr * (1 + vol)
```

### Q-learning addiction model (Redish 2004)
```python
# 매 cocaine 의 RPE floor: drug RPE 의 cannot be predicted away
def cocaine_td(rewards, drug_mask, gamma=0.9, alpha=0.1, drug_floor=1.0):
    V = np.zeros_like(rewards, dtype=float)
    for t in range(len(rewards) - 1):
        delta = rewards[t] + gamma * V[t+1] - V[t]
        if drug_mask[t]:
            delta = max(delta, drug_floor)   # 매 always positive RPE → compulsion
        V[t] += alpha * delta
    return V
```

## 매 결정 기준
| 상황 | Approach |
|---|---|
| Modeling phasic DA | classic TD with γ ≈ 0.9 |
| Modeling DA population variance | distributional TD with quantiles |
| Modeling habits vs goals | hybrid MF + MB with arbitrator |
| Modeling replay | SR + offline updates |
| Computational psychiatry | param fit per subject (hBayesDM, JAGS) |
| Drug / lesion effect | parameter perturbation (lower α, biased ε) |

**기본값**: 매 single-RPE TD 의 starting model. 매 distributional TD 의 modern population-DA fit. 매 SR / MB-MF arbitrator 의 prefrontal-hippocampal richness 가 필요할 때.

## 🔗 Graph
- 부모: [[Reinforcement-Learning]] · [[Computational-Neuroscience-RL|Computational-Neuroscience]]
- 변형: [[Distributional-RL]]
- Adjacent: [[Dopamine]] · [[Basal-Ganglia]] · [[Bayesian-Brain]]

## 🤖 LLM 활용
**언제**: literature digest (Schultz, Dayan, Niv, Daw papers), TD / SR sim scaffolding, hypothesis generation for fitting tasks.
**언제 X**: empirical claims about specific brain areas — 매 verify with primary source. 매 LLM 의 mix model-based 와 model-free terminology occasionally.

## ❌ 안티패턴
- **DA = reward**: 매 wrong — DA 의 RPE, 매 unpredicted reward 만 burst.
- **Single-RPE for all DA**: 매 distributional 의 newer view.
- **Equate brain 의 deep RL**: deep nets 의 inspired 가 X identical. 매 brain 의 sample-efficient, cortical, multi-system.
- **Ignore tonic DA**: motivation / vigor 의 separate from phasic RPE.
- **Behaviorism only**: ignore neural data — 매 brain → behavior 의 multi-level.

## 🧪 검증 / 중복
- Verified (Schultz 1997, Sutton & Barto 2018 ch 15, Dabney 2020 Nature, Daw 2011, Niv 2009 review, Stachenfeld 2017 SR).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — TD/distributional/SR/two-step patterns + brain-RL mapping |