Files
2nd/10_Wiki/Topics/DevOps_and_Security/Dopamine.md
T
2026-05-10 22:08:15 +09:00

146 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-dopamine
title: Dopamine
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Reward System, Reinforcement Signal, Prediction Error]
duplicate_of: none
source_trust_level: A
confidence_score: 0.85
verification_status: applied
tags: [neuroscience, reinforcement-learning, motivation, ux]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: rl
---
# Dopamine
## 매 한 줄
> **"매 reward prediction error 의 signal"**. 매 dopamine 의 modern view 는 pleasure 의 X, 매 *expected vs actual reward 의 차이* 의 broadcast. 매 Schultz (1997) 의 monkey VTA recording 의 RL 의 TD-error 의 isomorphism 의 establish. 매 product UX, addiction design, RL algorithm 의 shared substrate.
## 매 핵심
### 매 RPE (Reward Prediction Error)
- **Positive RPE**: 매 expected 보다 better. 매 dopamine burst.
- **Zero RPE**: 매 fully predicted. 매 baseline firing.
- **Negative RPE**: 매 expected 보다 worse. 매 firing dip.
### 매 RL 의 TD-error 와 의 mapping
- 매 δ = r + γV(s') V(s).
- 매 dopamine neuron 의 firing rate 의 δ 의 encode (Schultz, Dayan, Montague 1997).
### 매 응용
1. Variable-ratio schedule (slot machine, social media feed) — 매 maximal RPE.
2. Habit formation (intermittent reward).
3. Anhedonia / addiction 의 dopaminergic dysregulation.
4. RL agent design (curiosity, intrinsic motivation).
## 💻 패턴
### TD-learning (dopamine-analog)
```python
import numpy as np
def td_update(V, s, r, s_next, alpha=0.1, gamma=0.9):
"""V: value table. δ = TD error = 'dopamine signal'."""
delta = r + gamma * V[s_next] - V[s] # ← RPE
V[s] += alpha * delta
return delta # log this; it's the 'dopamine'
V = np.zeros(10)
for episode in range(1000):
s, r, s_next = sample_transition()
rpe = td_update(V, s, r, s_next)
```
### Curiosity-driven exploration (intrinsic dopamine analog)
```python
# Random Network Distillation (Burda 2018)
class RND(nn.Module):
def __init__(self):
super().__init__()
self.target = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 64))
self.predictor = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 64))
for p in self.target.parameters(): p.requires_grad = False
def intrinsic_reward(self, obs):
with torch.no_grad():
target = self.target(obs)
pred = self.predictor(obs)
return ((target - pred) ** 2).mean(-1) # novelty bonus
```
### Variable-ratio schedule simulator
```python
def variable_ratio_session(p_reward=0.1, n_pulls=100):
rpe_log = []
expected = p_reward # learned expectation
for _ in range(n_pulls):
r = 1.0 if np.random.rand() < p_reward else 0.0
rpe = r - expected
expected += 0.05 * rpe # slow learning
rpe_log.append(rpe)
return rpe_log
# Pattern: high-amplitude RPE persists → "addictive" engagement
```
### Hyperbolic discounting (dopamine-future)
```python
def hyperbolic_value(reward, delay, k=0.1):
"""Real human/animal — closer to hyperbolic than exponential."""
return reward / (1 + k * delay)
```
### Opponent process (reward + aversion)
```python
# Two-system: dopamine (reward) + serotonin (aversion / patience)
def dual_system_update(V_reward, V_aversion, r_pos, r_neg, s, s_next, alpha=0.1, gamma=0.9):
delta_reward = r_pos + gamma * V_reward[s_next] - V_reward[s]
delta_aversion = r_neg + gamma * V_aversion[s_next] - V_aversion[s]
V_reward[s] += alpha * delta_reward
V_aversion[s] += alpha * delta_aversion
return delta_reward, delta_aversion
```
## 매 결정 기준
| 상황 | Insight |
|---|---|
| Habit-forming product | Variable-ratio reward (Slot machine schedule) |
| Sustained engagement | Mix predictable + unpredictable wins |
| Avoid burnout | Avoid pure RPE-maximization (ethical concern) |
| RL exploration stuck | Add intrinsic reward (RND, ICM) |
| Anhedonia in user | Reduce expectation, surprise with low-cost wins |
**기본값**: 매 RPE-aware design — but 매 ethics 의 weight (manipulation 의 risk).
## 🔗 Graph
- 부모: [[Reinforcement Learning]] · [[Neuroscience]]
- 변형: [[Serotonin]] · [[Norepinephrine]]
- 응용: [[Habit Formation]] · [[Game Design]] · [[Recommender Systems]]
- Adjacent: [[TD-Learning]] · [[Curiosity-Driven RL]] · [[Behavioral Economics]]
## 🤖 LLM 활용
**언제**: 매 product UX 의 retention mechanic 의 audit. 매 dark-pattern 의 detection.
**언제 X**: 매 clinical advice. 매 LLM 의 medical claim 의 X.
## ❌ 안티패턴
- **Dopamine = pleasure 의 simplification**: 매 X. 매 RPE 의 signal — pleasure 는 separate (opioid).
- **Pure exploitation (no novelty)**: 매 user 의 RPE 의 0 의 disengage.
- **Manipulative dark pattern**: 매 ethical violation. 매 design 의 audit 의 mandatory.
## 🧪 검증 / 중복
- Verified (Schultz 1997 Science, Sutton & Barto 2018, Berridge 2007).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — RPE / TD-learning isomorphism + UX implication |