Files
2nd/10_Wiki/Topics/Other/Neurobiology-of-Reward.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

132 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-neurobiology-of-reward
title: Neurobiology of Reward
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Reward System, Dopamine System, Mesolimbic Pathway]
duplicate_of: none
source_trust_level: A
confidence_score: 0.9
verification_status: applied
tags: [neuroscience, reward, dopamine, RL]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: neuroscience-RL
---
# Neurobiology of Reward
## 매 한 줄
> **"매 dopamine 은 reward 자체 X, 매 reward prediction error 의 signal"**. 매 mesolimbic pathway (VTA → NAc) 가 매 expected vs actual outcome 의 차이를 encode 하며, 매 Schultz (1997) 가 매 발견. 매 modern RL (TD-learning, RLHF) 의 매 biological 의 root.
## 매 핵심
### 매 핵심 회로
- **VTA (ventral tegmental area)**: 매 dopamine 의 source neurons.
- **NAc (nucleus accumbens)**: 매 reward salience encoding.
- **PFC (prefrontal cortex)**: 매 value-based decision-making.
- **Amygdala**: 매 valence (positive/negative) encoding.
### 매 RPE (Reward Prediction Error)
- 매 RPE = actual_reward - expected_reward.
- 매 positive RPE → dopamine burst → 매 reinforce action.
- 매 negative RPE → dopamine dip → 매 weaken action.
- 매 zero RPE (fully predicted reward) → no signal.
### 매 응용
1. **RL algorithms**: TD-learning 매 RPE 와 mathematically equivalent.
2. **RLHF**: 매 reward model 매 human preference RPE 의 proxy.
3. **Addiction research**: 매 hijacked dopamine → compulsive behavior.
4. **UX design**: 매 variable reward schedule (slot machine effect).
## 💻 패턴
### TD-learning (Sutton & Barto, RL biological analog)
```python
# Temporal Difference learning — RPE 매 update signal
import numpy as np
def td_update(V, state, next_state, reward, alpha=0.1, gamma=0.99):
"""V[s] ← V[s] + α(r + γV[s'] - V[s])"""
rpe = reward + gamma * V[next_state] - V[state] # 매 RPE
V[state] += alpha * rpe
return V, rpe
```
### Dopamine neuron simulation
```python
def dopamine_response(predicted_r, actual_r, baseline=1.0):
"""Schultz (1997) — 매 phasic firing rate."""
rpe = actual_r - predicted_r
return baseline * np.exp(rpe) # scale baseline firing
```
### RLHF reward model (modern bridge)
```python
# transformers + trl
from trl import PPOTrainer, PPOConfig
from transformers import AutoModelForCausalLMWithValueHead
# 매 reward model = learned approximation of human RPE
config = PPOConfig(model_name="meta-llama/Llama-3.1-8B")
trainer = PPOTrainer(config, model, tokenizer, reward_model=reward_fn)
# Reward signal drives policy update → analog of dopamine update
```
### Variable reward schedule (UX)
```python
import random
def variable_reward(action_count):
"""매 intermittent reinforcement — strongest learning."""
if random.random() < 0.3: # 30% reward
return "reward"
return "no_reward"
```
### Aversive learning (negative valence)
```python
def negative_rpe_update(V, s, s_, r, alpha=0.1):
"""매 amygdala-mediated learning."""
rpe = r + V[s_] - V[s] # r typically negative
V[s] += alpha * rpe
return V
```
## 매 결정 기준
| 질문 | 답 |
|---|---|
| 매 dopamine 매 pleasure 인가? | X — RPE signal (wanting ≠ liking) |
| 매 RL 의 reward 매 dopamine? | Functional analog yes (Schultz) |
| 매 addiction 매 dopamine 과잉? | X — dysregulated RPE / hijacked salience |
| 매 RLHF 매 brain-like? | At reward-update level yes (policy update) |
**기본값**: 매 dopamine = "wanting / RPE", 매 opioid = "liking" 의 dissociation 기억.
## 🔗 Graph
- 부모: [[Reinforcement-Learning]]
- 응용: [[RLHF]] · [[TD-Learning]] · [[Addiction]]
- Adjacent: [[Operant-Conditioning]] · [[Habit-Formation]]
## 🤖 LLM 활용
**언제**: 매 reward modeling intuition, 매 RLHF reward shaping debugging, 매 motivation framework explanation.
**언제 X**: 매 clinical psychiatry — 매 specialist 영역.
## ❌ 안티패턴
- **Dopamine = pleasure**: 매 popular myth — 실제는 RPE / wanting.
- **More dopamine = better**: 매 tonic 과잉 매 schizophrenia, parkinson off-state.
- **Reward hacking**: 매 RL agent 매 RPE exploit, 매 brain analog (addiction).
## 🧪 검증 / 중복
- Verified (Schultz 1997 *Science*; Berridge & Robinson 1998 wanting/liking; Sutton & Barto *RL Book* 2018 2e).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — RPE biology + RL bridge + RLHF analog |