"매 dopamine = reward prediction error (RPE)". Schultz 1997 의 single-cell recording 의 매 TD-learning 의 brain analogue 를 confirm. 매 basal ganglia 의 actor-critic, 매 prefrontal cortex 의 model-based planning. 2026 현재 매 distributional RL (Dabney 2020) 의 dopamine population code 의 confirmation 과 매 deep RL ↔ neuroscience 의 active bridge.
매 핵심
매 핵심 발견
Dopamine = RPE (Schultz, Dayan, Montague 1997): VTA / SNc dopamine neuron 의 firing 의 (R + γV(s') − V(s)) 의 encoding.
importnumpyasnpdeftd_value(rewards,gamma=0.9,alpha=0.1):V=np.zeros_like(rewards,dtype=float)rpes=np.zeros_like(rewards,dtype=float)fortinrange(len(rewards)-1):rpe=rewards[t]+gamma*V[t+1]-V[t]# 매 dopamine signalV[t]+=alpha*rperpes[t]=rpereturnV,rpes# 매 Schultz 1997 의 cue-reward conditioningtrials=[]fortrialinrange(100):seq=np.zeros(10)seq[3]=1.0# CS at t=3seq[7]=1.0# reward at t=7V,rpes=td_value(seq)trials.append(rpes)# 매 early trials: phasic burst at reward (t=7)# 매 late trials: burst shifts to CS (t=3) — 매 prediction-error transfer
Distributional TD (Dabney 2020 신경)
# 매 each "DA neuron" 의 own quantile τᵢ ∈ (0,1) 와 asymmetric scalingdefquantile_td(returns,taus,lr=0.05):Q=np.zeros_like(taus)forrinreturns:fori,tauinenumerate(taus):delta=r-Q[i]# 매 asymmetric: positive RPE 의 tau-weighted, negative 의 (1-tau)Q[i]+=lr*(tauifdelta>0else(1-tau))*deltareturnQ# 매 distribution-encoding population
Successor representation
defsuccessor_repr(transitions,gamma=0.9):n=transitions.shape[0]M=np.zeros((n,n))fors,spintransitions:M[s]+=0.1*(np.eye(n)[s]+gamma*M[sp]-M[s])returnM# 매 hippocampal SR (Stachenfeld 2017)
Two-step task (Daw 2011 model-based vs model-free)
# 매 stage1: A → 0.7 → S2_left, 0.3 → S2_right# 매 stage2: reward varies# 매 model-free: stay if rewarded, regardless of transition# 매 model-based: stay if rewarded AND transition was commondeftwo_step_choice(prev_choice,prev_reward,prev_common,w_mb=0.5):# 매 w_mb 의 model-based weightmf_pref=1ifprev_rewardelse-1mb_pref=(1ifprev_rewardandprev_commonelse1ifnotprev_rewardandnotprev_commonelse-1)score=(1-w_mb)*mf_pref+w_mb*mb_prefreturnprev_choiceifscore>0else1-prev_choice
Volatility-weighted learning rate (Behrens 2007)
# 매 ACC 의 volatility 의 track, 매 high vol → high LRdefvolatility_lr(rpes,base_lr=0.05):vol=np.var(rpes[-10:])# rolling variancereturnbase_lr*(1+vol)
Q-learning addiction model (Redish 2004)
# 매 cocaine 의 RPE floor: drug RPE 의 cannot be predicted awaydefcocaine_td(rewards,drug_mask,gamma=0.9,alpha=0.1,drug_floor=1.0):V=np.zeros_like(rewards,dtype=float)fortinrange(len(rewards)-1):delta=rewards[t]+gamma*V[t+1]-V[t]ifdrug_mask[t]:delta=max(delta,drug_floor)# 매 always positive RPE → compulsionV[t]+=alpha*deltareturnV
매 결정 기준
상황
Approach
Modeling phasic DA
classic TD with γ ≈ 0.9
Modeling DA population variance
distributional TD with quantiles
Modeling habits vs goals
hybrid MF + MB with arbitrator
Modeling replay
SR + offline updates
Computational psychiatry
param fit per subject (hBayesDM, JAGS)
Drug / lesion effect
parameter perturbation (lower α, biased ε)
기본값: 매 single-RPE TD 의 starting model. 매 distributional TD 의 modern population-DA fit. 매 SR / MB-MF arbitrator 의 prefrontal-hippocampal richness 가 필요할 때.
언제: literature digest (Schultz, Dayan, Niv, Daw papers), TD / SR sim scaffolding, hypothesis generation for fitting tasks.
언제 X: empirical claims about specific brain areas — 매 verify with primary source. 매 LLM 의 mix model-based 와 model-free terminology occasionally.
❌ 안티패턴
DA = reward: 매 wrong — DA 의 RPE, 매 unpredicted reward 만 burst.
Single-RPE for all DA: 매 distributional 의 newer view.
Equate brain 의 deep RL: deep nets 의 inspired 가 X identical. 매 brain 의 sample-efficient, cortical, multi-system.
Ignore tonic DA: motivation / vigor 의 separate from phasic RPE.
Behaviorism only: ignore neural data — 매 brain → behavior 의 multi-level.