"매 dopamine 은 reward 자체 X, 매 reward prediction error 의 signal". 매 mesolimbic pathway (VTA → NAc) 가 매 expected vs actual outcome 의 차이를 encode 하며, 매 Schultz (1997) 가 매 발견. 매 modern RL (TD-learning, RLHF) 의 매 biological 의 root.
매 핵심
매 핵심 회로
VTA (ventral tegmental area): 매 dopamine 의 source neurons.
NAc (nucleus accumbens): 매 reward salience encoding.
PFC (prefrontal cortex): 매 value-based decision-making.
Amygdala: 매 valence (positive/negative) encoding.
매 RPE (Reward Prediction Error)
매 RPE = actual_reward - expected_reward.
매 positive RPE → dopamine burst → 매 reinforce action.
매 negative RPE → dopamine dip → 매 weaken action.
매 zero RPE (fully predicted reward) → no signal.
매 응용
RL algorithms: TD-learning 매 RPE 와 mathematically equivalent.
RLHF: 매 reward model 매 human preference RPE 의 proxy.
Addiction research: 매 hijacked dopamine → compulsive behavior.
UX design: 매 variable reward schedule (slot machine effect).
# Temporal Difference learning — RPE 매 update signalimportnumpyasnpdeftd_update(V,state,next_state,reward,alpha=0.1,gamma=0.99):"""V[s] ← V[s] + α(r + γV[s'] - V[s])"""rpe=reward+gamma*V[next_state]-V[state]# 매 RPEV[state]+=alpha*rpereturnV,rpe
Dopamine neuron simulation
defdopamine_response(predicted_r,actual_r,baseline=1.0):"""Schultz (1997) — 매 phasic firing rate."""rpe=actual_r-predicted_rreturnbaseline*np.exp(rpe)# scale baseline firing
RLHF reward model (modern bridge)
# transformers + trlfromtrlimportPPOTrainer,PPOConfigfromtransformersimportAutoModelForCausalLMWithValueHead# 매 reward model = learned approximation of human RPEconfig=PPOConfig(model_name="meta-llama/Llama-3.1-8B")trainer=PPOTrainer(config,model,tokenizer,reward_model=reward_fn)# Reward signal drives policy update → analog of dopamine update