---
id: wiki-2026-0508-neurobiology-of-reward
title: Neurobiology of Reward
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Reward System, Dopamine System, Mesolimbic Pathway]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [neuroscience, reward, dopamine, RL]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
  language: python
  framework: neuroscience-RL
---

# Neurobiology of Reward

## 매 한 줄
> **"매 dopamine 은 reward 자체 X, 매 reward prediction error 의 signal"**. 매 mesolimbic pathway (VTA → NAc) 가 매 expected vs actual outcome 의 차이를 encode 하며, 매 Schultz (1997) 가 매 발견. 매 modern RL (TD-learning, RLHF) 의 매 biological 의 root.

## 매 핵심

### 매 핵심 회로
- **VTA (ventral tegmental area)**: 매 dopamine 의 source neurons.
- **NAc (nucleus accumbens)**: 매 reward salience encoding.
- **PFC (prefrontal cortex)**: 매 value-based decision-making.
- **Amygdala**: 매 valence (positive/negative) encoding.

### 매 RPE (Reward Prediction Error)
- 매 RPE = actual_reward - expected_reward.
- 매 positive RPE → dopamine burst → 매 reinforce action.
- 매 negative RPE → dopamine dip → 매 weaken action.
- 매 zero RPE (fully predicted reward) → no signal.

### 매 응용
1. **RL algorithms**: TD-learning 매 RPE 와 mathematically equivalent.
2. **RLHF**: 매 reward model 매 human preference RPE 의 proxy.
3. **Addiction research**: 매 hijacked dopamine → compulsive behavior.
4. **UX design**: 매 variable reward schedule (slot machine effect).

## 💻 패턴

### TD-learning (Sutton & Barto, RL biological analog)
```python
# Temporal Difference learning — RPE 매 update signal
import numpy as np

def td_update(V, state, next_state, reward, alpha=0.1, gamma=0.99):
    """V[s] ← V[s] + α(r + γV[s'] - V[s])"""
    rpe = reward + gamma * V[next_state] - V[state]  # 매 RPE
    V[state] += alpha * rpe
    return V, rpe
```

### Dopamine neuron simulation
```python
def dopamine_response(predicted_r, actual_r, baseline=1.0):
    """Schultz (1997) — 매 phasic firing rate."""
    rpe = actual_r - predicted_r
    return baseline * np.exp(rpe)  # scale baseline firing
```

### RLHF reward model (modern bridge)
```python
# transformers + trl
from trl import PPOTrainer, PPOConfig
from transformers import AutoModelForCausalLMWithValueHead

# 매 reward model = learned approximation of human RPE
config = PPOConfig(model_name="meta-llama/Llama-3.1-8B")
trainer = PPOTrainer(config, model, tokenizer, reward_model=reward_fn)
# Reward signal drives policy update → analog of dopamine update
```

### Variable reward schedule (UX)
```python
import random
def variable_reward(action_count):
    """매 intermittent reinforcement — strongest learning."""
    if random.random() < 0.3:  # 30% reward
        return "reward"
    return "no_reward"
```

### Aversive learning (negative valence)
```python
def negative_rpe_update(V, s, s_, r, alpha=0.1):
    """매 amygdala-mediated learning."""
    rpe = r + V[s_] - V[s]  # r typically negative
    V[s] += alpha * rpe
    return V
```

## 매 결정 기준
| 질문 | 답 |
|---|---|
| 매 dopamine 매 pleasure 인가? | X — RPE signal (wanting ≠ liking) |
| 매 RL 의 reward 매 dopamine? | Functional analog yes (Schultz) |
| 매 addiction 매 dopamine 과잉? | X — dysregulated RPE / hijacked salience |
| 매 RLHF 매 brain-like? | At reward-update level yes (policy update) |

**기본값**: 매 dopamine = "wanting / RPE", 매 opioid = "liking" 의 dissociation 기억.

## 🔗 Graph
- 부모: [[Reinforcement-Learning]]
- 응용: [[RLHF]] · [[TD-Learning]] · [[Addiction]]
- Adjacent: [[Operant-Conditioning]] · [[Habit-Formation]]

## 🤖 LLM 활용
**언제**: 매 reward modeling intuition, 매 RLHF reward shaping debugging, 매 motivation framework explanation.
**언제 X**: 매 clinical psychiatry — 매 specialist 영역.

## ❌ 안티패턴
- **Dopamine = pleasure**: 매 popular myth — 실제는 RPE / wanting.
- **More dopamine = better**: 매 tonic 과잉 매 schizophrenia, parkinson off-state.
- **Reward hacking**: 매 RL agent 매 RPE exploit, 매 brain analog (addiction).

## 🧪 검증 / 중복
- Verified (Schultz 1997 *Science*; Berridge & Robinson 1998 wanting/liking; Sutton & Barto *RL Book* 2018 2e).
- 신뢰도 A.

## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — RPE biology + RL bridge + RLHF analog |