f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
132 lines
4.5 KiB
Markdown
132 lines
4.5 KiB
Markdown
---
|
||
id: wiki-2026-0508-neurobiology-of-reward
|
||
title: Neurobiology of Reward
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Reward System, Dopamine System, Mesolimbic Pathway]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.9
|
||
verification_status: applied
|
||
tags: [neuroscience, reward, dopamine, RL]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: python
|
||
framework: neuroscience-RL
|
||
---
|
||
|
||
# Neurobiology of Reward
|
||
|
||
## 매 한 줄
|
||
> **"매 dopamine 은 reward 자체 X, 매 reward prediction error 의 signal"**. 매 mesolimbic pathway (VTA → NAc) 가 매 expected vs actual outcome 의 차이를 encode 하며, 매 Schultz (1997) 가 매 발견. 매 modern RL (TD-learning, RLHF) 의 매 biological 의 root.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 핵심 회로
|
||
- **VTA (ventral tegmental area)**: 매 dopamine 의 source neurons.
|
||
- **NAc (nucleus accumbens)**: 매 reward salience encoding.
|
||
- **PFC (prefrontal cortex)**: 매 value-based decision-making.
|
||
- **Amygdala**: 매 valence (positive/negative) encoding.
|
||
|
||
### 매 RPE (Reward Prediction Error)
|
||
- 매 RPE = actual_reward - expected_reward.
|
||
- 매 positive RPE → dopamine burst → 매 reinforce action.
|
||
- 매 negative RPE → dopamine dip → 매 weaken action.
|
||
- 매 zero RPE (fully predicted reward) → no signal.
|
||
|
||
### 매 응용
|
||
1. **RL algorithms**: TD-learning 매 RPE 와 mathematically equivalent.
|
||
2. **RLHF**: 매 reward model 매 human preference RPE 의 proxy.
|
||
3. **Addiction research**: 매 hijacked dopamine → compulsive behavior.
|
||
4. **UX design**: 매 variable reward schedule (slot machine effect).
|
||
|
||
## 💻 패턴
|
||
|
||
### TD-learning (Sutton & Barto, RL biological analog)
|
||
```python
|
||
# Temporal Difference learning — RPE 매 update signal
|
||
import numpy as np
|
||
|
||
def td_update(V, state, next_state, reward, alpha=0.1, gamma=0.99):
|
||
"""V[s] ← V[s] + α(r + γV[s'] - V[s])"""
|
||
rpe = reward + gamma * V[next_state] - V[state] # 매 RPE
|
||
V[state] += alpha * rpe
|
||
return V, rpe
|
||
```
|
||
|
||
### Dopamine neuron simulation
|
||
```python
|
||
def dopamine_response(predicted_r, actual_r, baseline=1.0):
|
||
"""Schultz (1997) — 매 phasic firing rate."""
|
||
rpe = actual_r - predicted_r
|
||
return baseline * np.exp(rpe) # scale baseline firing
|
||
```
|
||
|
||
### RLHF reward model (modern bridge)
|
||
```python
|
||
# transformers + trl
|
||
from trl import PPOTrainer, PPOConfig
|
||
from transformers import AutoModelForCausalLMWithValueHead
|
||
|
||
# 매 reward model = learned approximation of human RPE
|
||
config = PPOConfig(model_name="meta-llama/Llama-3.1-8B")
|
||
trainer = PPOTrainer(config, model, tokenizer, reward_model=reward_fn)
|
||
# Reward signal drives policy update → analog of dopamine update
|
||
```
|
||
|
||
### Variable reward schedule (UX)
|
||
```python
|
||
import random
|
||
def variable_reward(action_count):
|
||
"""매 intermittent reinforcement — strongest learning."""
|
||
if random.random() < 0.3: # 30% reward
|
||
return "reward"
|
||
return "no_reward"
|
||
```
|
||
|
||
### Aversive learning (negative valence)
|
||
```python
|
||
def negative_rpe_update(V, s, s_, r, alpha=0.1):
|
||
"""매 amygdala-mediated learning."""
|
||
rpe = r + V[s_] - V[s] # r typically negative
|
||
V[s] += alpha * rpe
|
||
return V
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 질문 | 답 |
|
||
|---|---|
|
||
| 매 dopamine 매 pleasure 인가? | X — RPE signal (wanting ≠ liking) |
|
||
| 매 RL 의 reward 매 dopamine? | Functional analog yes (Schultz) |
|
||
| 매 addiction 매 dopamine 과잉? | X — dysregulated RPE / hijacked salience |
|
||
| 매 RLHF 매 brain-like? | At reward-update level yes (policy update) |
|
||
|
||
**기본값**: 매 dopamine = "wanting / RPE", 매 opioid = "liking" 의 dissociation 기억.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[Reinforcement-Learning]]
|
||
- 응용: [[RLHF]] · [[TD-Learning]] · [[Addiction]]
|
||
- Adjacent: [[Operant-Conditioning]] · [[Habit-Formation]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 reward modeling intuition, 매 RLHF reward shaping debugging, 매 motivation framework explanation.
|
||
**언제 X**: 매 clinical psychiatry — 매 specialist 영역.
|
||
|
||
## ❌ 안티패턴
|
||
- **Dopamine = pleasure**: 매 popular myth — 실제는 RPE / wanting.
|
||
- **More dopamine = better**: 매 tonic 과잉 매 schizophrenia, parkinson off-state.
|
||
- **Reward hacking**: 매 RL agent 매 RPE exploit, 매 brain analog (addiction).
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Schultz 1997 *Science*; Berridge & Robinson 1998 wanting/liking; Sutton & Barto *RL Book* 2018 2e).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — RPE biology + RL bridge + RLHF analog |
|