--- id: wiki-2026-0508-neurobiology-of-reward title: Neurobiology of Reward category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Reward System, Dopamine System, Mesolimbic Pathway] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [neuroscience, reward, dopamine, RL] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: neuroscience-RL --- # Neurobiology of Reward ## 매 한 줄 > **"매 dopamine 은 reward 자체 X, 매 reward prediction error 의 signal"**. 매 mesolimbic pathway (VTA → NAc) 가 매 expected vs actual outcome 의 차이를 encode 하며, 매 Schultz (1997) 가 매 발견. 매 modern RL (TD-learning, RLHF) 의 매 biological 의 root. ## 매 핵심 ### 매 핵심 회로 - **VTA (ventral tegmental area)**: 매 dopamine 의 source neurons. - **NAc (nucleus accumbens)**: 매 reward salience encoding. - **PFC (prefrontal cortex)**: 매 value-based decision-making. - **Amygdala**: 매 valence (positive/negative) encoding. ### 매 RPE (Reward Prediction Error) - 매 RPE = actual_reward - expected_reward. - 매 positive RPE → dopamine burst → 매 reinforce action. - 매 negative RPE → dopamine dip → 매 weaken action. - 매 zero RPE (fully predicted reward) → no signal. ### 매 응용 1. **RL algorithms**: TD-learning 매 RPE 와 mathematically equivalent. 2. **RLHF**: 매 reward model 매 human preference RPE 의 proxy. 3. **Addiction research**: 매 hijacked dopamine → compulsive behavior. 4. **UX design**: 매 variable reward schedule (slot machine effect). ## 💻 패턴 ### TD-learning (Sutton & Barto, RL biological analog) ```python # Temporal Difference learning — RPE 매 update signal import numpy as np def td_update(V, state, next_state, reward, alpha=0.1, gamma=0.99): """V[s] ← V[s] + α(r + γV[s'] - V[s])""" rpe = reward + gamma * V[next_state] - V[state] # 매 RPE V[state] += alpha * rpe return V, rpe ``` ### Dopamine neuron simulation ```python def dopamine_response(predicted_r, actual_r, baseline=1.0): """Schultz (1997) — 매 phasic firing rate.""" rpe = actual_r - predicted_r return baseline * np.exp(rpe) # scale baseline firing ``` ### RLHF reward model (modern bridge) ```python # transformers + trl from trl import PPOTrainer, PPOConfig from transformers import AutoModelForCausalLMWithValueHead # 매 reward model = learned approximation of human RPE config = PPOConfig(model_name="meta-llama/Llama-3.1-8B") trainer = PPOTrainer(config, model, tokenizer, reward_model=reward_fn) # Reward signal drives policy update → analog of dopamine update ``` ### Variable reward schedule (UX) ```python import random def variable_reward(action_count): """매 intermittent reinforcement — strongest learning.""" if random.random() < 0.3: # 30% reward return "reward" return "no_reward" ``` ### Aversive learning (negative valence) ```python def negative_rpe_update(V, s, s_, r, alpha=0.1): """매 amygdala-mediated learning.""" rpe = r + V[s_] - V[s] # r typically negative V[s] += alpha * rpe return V ``` ## 매 결정 기준 | 질문 | 답 | |---|---| | 매 dopamine 매 pleasure 인가? | X — RPE signal (wanting ≠ liking) | | 매 RL 의 reward 매 dopamine? | Functional analog yes (Schultz) | | 매 addiction 매 dopamine 과잉? | X — dysregulated RPE / hijacked salience | | 매 RLHF 매 brain-like? | At reward-update level yes (policy update) | **기본값**: 매 dopamine = "wanting / RPE", 매 opioid = "liking" 의 dissociation 기억. ## 🔗 Graph - 부모: [[Reinforcement-Learning]] - 응용: [[RLHF]] · [[TD-Learning]] · [[Addiction]] - Adjacent: [[Operant-Conditioning]] · [[Habit-Formation]] ## 🤖 LLM 활용 **언제**: 매 reward modeling intuition, 매 RLHF reward shaping debugging, 매 motivation framework explanation. **언제 X**: 매 clinical psychiatry — 매 specialist 영역. ## ❌ 안티패턴 - **Dopamine = pleasure**: 매 popular myth — 실제는 RPE / wanting. - **More dopamine = better**: 매 tonic 과잉 매 schizophrenia, parkinson off-state. - **Reward hacking**: 매 RL agent 매 RPE exploit, 매 brain analog (addiction). ## 🧪 검증 / 중복 - Verified (Schultz 1997 *Science*; Berridge & Robinson 1998 wanting/liking; Sutton & Barto *RL Book* 2018 2e). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — RPE biology + RL bridge + RLHF analog |