--- id: wiki-2026-0508-rl-neuroscience title: RL Neuroscience category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Reinforcement Learning Neuroscience, Computational Neuroscience of RL, Dopamine RPE] duplicate_of: none source_trust_level: A confidence_score: 0.85 verification_status: applied tags: [reinforcement-learning, neuroscience, dopamine, computational-neuroscience] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: numpy --- # RL Neuroscience ## 매 한 줄 > **"매 dopamine = reward prediction error (RPE)"**. Schultz 1997 의 single-cell recording 의 매 TD-learning 의 brain analogue 를 confirm. 매 basal ganglia 의 actor-critic, 매 prefrontal cortex 의 model-based planning. 2026 현재 매 distributional RL (Dabney 2020) 의 dopamine population code 의 confirmation 과 매 deep RL ↔ neuroscience 의 active bridge. ## 매 핵심 ### 매 핵심 발견 - **Dopamine = RPE** (Schultz, Dayan, Montague 1997): VTA / SNc dopamine neuron 의 firing 의 (R + γV(s') − V(s)) 의 encoding. - **Phasic vs tonic**: phasic burst = positive RPE, dip = negative RPE; tonic = uncertainty / motivation. - **Distributional dopamine** (Dabney/Kurth-Nelson 2020 Nature): different DA neurons 의 different return-distribution quantiles. - **Basal ganglia 의 actor-critic**: striatum (D1 direct = go, D2 indirect = no-go) = actor, dopamine = critic signal. - **PFC + hippocampus 의 model-based**: replay, planning, successor representation. ### 매 brain ↔ RL mapping | Brain | RL concept | |---|---| | VTA / SNc dopamine | TD error δ | | Striatum (D1/D2) | actor / policy | | Ventral striatum | state value V(s) | | OFC | expected outcome / Q(s,a) | | dlPFC | working memory / model-based | | Hippocampus | successor representation, replay | | Anterior cingulate | exploration / volatility | ### 매 model-free vs model-based - **Model-free** (habit, dorsolateral striatum): TD, slow, cached. - **Model-based** (goal-directed, dorsomedial striatum + PFC): plan, fast adapt, costly. - **Arbitrator** (Daw 2005): uncertainty-weighted blend — habits 의 trained data 에서 dominate. ### 매 응용 1. Computational psychiatry (addiction, depression, OCD as RL dysfunction). 2. Drug action modeling (cocaine, SSRI, ketamine). 3. Brain-inspired RL (distributional, hierarchical, replay). 4. Neural prosthetics (BCI with RL decoding). ## 💻 패턴 ### TD-learning 의 dopamine sim ```python import numpy as np def td_value(rewards, gamma=0.9, alpha=0.1): V = np.zeros_like(rewards, dtype=float) rpes = np.zeros_like(rewards, dtype=float) for t in range(len(rewards) - 1): rpe = rewards[t] + gamma * V[t+1] - V[t] # 매 dopamine signal V[t] += alpha * rpe rpes[t] = rpe return V, rpes # 매 Schultz 1997 의 cue-reward conditioning trials = [] for trial in range(100): seq = np.zeros(10) seq[3] = 1.0 # CS at t=3 seq[7] = 1.0 # reward at t=7 V, rpes = td_value(seq) trials.append(rpes) # 매 early trials: phasic burst at reward (t=7) # 매 late trials: burst shifts to CS (t=3) — 매 prediction-error transfer ``` ### Distributional TD (Dabney 2020 신경) ```python # 매 each "DA neuron" 의 own quantile τᵢ ∈ (0,1) 와 asymmetric scaling def quantile_td(returns, taus, lr=0.05): Q = np.zeros_like(taus) for r in returns: for i, tau in enumerate(taus): delta = r - Q[i] # 매 asymmetric: positive RPE 의 tau-weighted, negative 의 (1-tau) Q[i] += lr * (tau if delta > 0 else (1 - tau)) * delta return Q # 매 distribution-encoding population ``` ### Successor representation ```python def successor_repr(transitions, gamma=0.9): n = transitions.shape[0] M = np.zeros((n, n)) for s, sp in transitions: M[s] += 0.1 * (np.eye(n)[s] + gamma * M[sp] - M[s]) return M # 매 hippocampal SR (Stachenfeld 2017) ``` ### Two-step task (Daw 2011 model-based vs model-free) ```python # 매 stage1: A → 0.7 → S2_left, 0.3 → S2_right # 매 stage2: reward varies # 매 model-free: stay if rewarded, regardless of transition # 매 model-based: stay if rewarded AND transition was common def two_step_choice(prev_choice, prev_reward, prev_common, w_mb=0.5): # 매 w_mb 의 model-based weight mf_pref = 1 if prev_reward else -1 mb_pref = (1 if prev_reward and prev_common else 1 if not prev_reward and not prev_common else -1) score = (1 - w_mb) * mf_pref + w_mb * mb_pref return prev_choice if score > 0 else 1 - prev_choice ``` ### Volatility-weighted learning rate (Behrens 2007) ```python # 매 ACC 의 volatility 의 track, 매 high vol → high LR def volatility_lr(rpes, base_lr=0.05): vol = np.var(rpes[-10:]) # rolling variance return base_lr * (1 + vol) ``` ### Q-learning addiction model (Redish 2004) ```python # 매 cocaine 의 RPE floor: drug RPE 의 cannot be predicted away def cocaine_td(rewards, drug_mask, gamma=0.9, alpha=0.1, drug_floor=1.0): V = np.zeros_like(rewards, dtype=float) for t in range(len(rewards) - 1): delta = rewards[t] + gamma * V[t+1] - V[t] if drug_mask[t]: delta = max(delta, drug_floor) # 매 always positive RPE → compulsion V[t] += alpha * delta return V ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Modeling phasic DA | classic TD with γ ≈ 0.9 | | Modeling DA population variance | distributional TD with quantiles | | Modeling habits vs goals | hybrid MF + MB with arbitrator | | Modeling replay | SR + offline updates | | Computational psychiatry | param fit per subject (hBayesDM, JAGS) | | Drug / lesion effect | parameter perturbation (lower α, biased ε) | **기본값**: 매 single-RPE TD 의 starting model. 매 distributional TD 의 modern population-DA fit. 매 SR / MB-MF arbitrator 의 prefrontal-hippocampal richness 가 필요할 때. ## 🔗 Graph - 부모: [[Reinforcement-Learning]] · [[Computational-Neuroscience-RL|Computational-Neuroscience]] - 변형: [[Distributional-RL]] - Adjacent: [[Dopamine]] · [[Basal-Ganglia]] · [[Bayesian-Brain]] ## 🤖 LLM 활용 **언제**: literature digest (Schultz, Dayan, Niv, Daw papers), TD / SR sim scaffolding, hypothesis generation for fitting tasks. **언제 X**: empirical claims about specific brain areas — 매 verify with primary source. 매 LLM 의 mix model-based 와 model-free terminology occasionally. ## ❌ 안티패턴 - **DA = reward**: 매 wrong — DA 의 RPE, 매 unpredicted reward 만 burst. - **Single-RPE for all DA**: 매 distributional 의 newer view. - **Equate brain 의 deep RL**: deep nets 의 inspired 가 X identical. 매 brain 의 sample-efficient, cortical, multi-system. - **Ignore tonic DA**: motivation / vigor 의 separate from phasic RPE. - **Behaviorism only**: ignore neural data — 매 brain → behavior 의 multi-level. ## 🧪 검증 / 중복 - Verified (Schultz 1997, Sutton & Barto 2018 ch 15, Dabney 2020 Nature, Daw 2011, Niv 2009 review, Stachenfeld 2017 SR). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — TD/distributional/SR/two-step patterns + brain-RL mapping |