Files
2nd/10_Wiki/Topics/DevOps_and_Security/Dopamine.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

145 lines
5.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-dopamine
title: Dopamine
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Reward System, Reinforcement Signal, Prediction Error]
duplicate_of: none
source_trust_level: A
confidence_score: 0.85
verification_status: applied
tags: [neuroscience, reinforcement-learning, motivation, ux]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: rl
---
# Dopamine
## 매 한 줄
> **"매 reward prediction error 의 signal"**. 매 dopamine 의 modern view 는 pleasure 의 X, 매 *expected vs actual reward 의 차이* 의 broadcast. 매 Schultz (1997) 의 monkey VTA recording 의 RL 의 TD-error 의 isomorphism 의 establish. 매 product UX, addiction design, RL algorithm 의 shared substrate.
## 매 핵심
### 매 RPE (Reward Prediction Error)
- **Positive RPE**: 매 expected 보다 better. 매 dopamine burst.
- **Zero RPE**: 매 fully predicted. 매 baseline firing.
- **Negative RPE**: 매 expected 보다 worse. 매 firing dip.
### 매 RL 의 TD-error 와 의 mapping
- 매 δ = r + γV(s') V(s).
- 매 dopamine neuron 의 firing rate 의 δ 의 encode (Schultz, Dayan, Montague 1997).
### 매 응용
1. Variable-ratio schedule (slot machine, social media feed) — 매 maximal RPE.
2. Habit formation (intermittent reward).
3. Anhedonia / addiction 의 dopaminergic dysregulation.
4. RL agent design (curiosity, intrinsic motivation).
## 💻 패턴
### TD-learning (dopamine-analog)
```python
import numpy as np
def td_update(V, s, r, s_next, alpha=0.1, gamma=0.9):
"""V: value table. δ = TD error = 'dopamine signal'."""
delta = r + gamma * V[s_next] - V[s] # ← RPE
V[s] += alpha * delta
return delta # log this; it's the 'dopamine'
V = np.zeros(10)
for episode in range(1000):
s, r, s_next = sample_transition()
rpe = td_update(V, s, r, s_next)
```
### Curiosity-driven exploration (intrinsic dopamine analog)
```python
# Random Network Distillation (Burda 2018)
class RND(nn.Module):
def __init__(self):
super().__init__()
self.target = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 64))
self.predictor = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 64))
for p in self.target.parameters(): p.requires_grad = False
def intrinsic_reward(self, obs):
with torch.no_grad():
target = self.target(obs)
pred = self.predictor(obs)
return ((target - pred) ** 2).mean(-1) # novelty bonus
```
### Variable-ratio schedule simulator
```python
def variable_ratio_session(p_reward=0.1, n_pulls=100):
rpe_log = []
expected = p_reward # learned expectation
for _ in range(n_pulls):
r = 1.0 if np.random.rand() < p_reward else 0.0
rpe = r - expected
expected += 0.05 * rpe # slow learning
rpe_log.append(rpe)
return rpe_log
# Pattern: high-amplitude RPE persists → "addictive" engagement
```
### Hyperbolic discounting (dopamine-future)
```python
def hyperbolic_value(reward, delay, k=0.1):
"""Real human/animal — closer to hyperbolic than exponential."""
return reward / (1 + k * delay)
```
### Opponent process (reward + aversion)
```python
# Two-system: dopamine (reward) + serotonin (aversion / patience)
def dual_system_update(V_reward, V_aversion, r_pos, r_neg, s, s_next, alpha=0.1, gamma=0.9):
delta_reward = r_pos + gamma * V_reward[s_next] - V_reward[s]
delta_aversion = r_neg + gamma * V_aversion[s_next] - V_aversion[s]
V_reward[s] += alpha * delta_reward
V_aversion[s] += alpha * delta_aversion
return delta_reward, delta_aversion
```
## 매 결정 기준
| 상황 | Insight |
|---|---|
| Habit-forming product | Variable-ratio reward (Slot machine schedule) |
| Sustained engagement | Mix predictable + unpredictable wins |
| Avoid burnout | Avoid pure RPE-maximization (ethical concern) |
| RL exploration stuck | Add intrinsic reward (RND, ICM) |
| Anhedonia in user | Reduce expectation, surprise with low-cost wins |
**기본값**: 매 RPE-aware design — but 매 ethics 의 weight (manipulation 의 risk).
## 🔗 Graph
- 부모: [[Reinforcement Learning]]
- 응용: [[Habit Formation]] · [[Game Design]] · [[Recommender Systems]]
- Adjacent: [[TD-Learning]] · [[Behavioral Economics]]
## 🤖 LLM 활용
**언제**: 매 product UX 의 retention mechanic 의 audit. 매 dark-pattern 의 detection.
**언제 X**: 매 clinical advice. 매 LLM 의 medical claim 의 X.
## ❌ 안티패턴
- **Dopamine = pleasure 의 simplification**: 매 X. 매 RPE 의 signal — pleasure 는 separate (opioid).
- **Pure exploitation (no novelty)**: 매 user 의 RPE 의 0 의 disengage.
- **Manipulative dark pattern**: 매 ethical violation. 매 design 의 audit 의 mandatory.
## 🧪 검증 / 중복
- Verified (Schultz 1997 Science, Sutton & Barto 2018, Berridge 2007).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — RPE / TD-learning isomorphism + UX implication |