--- id: wiki-2026-0508-dopamine title: Dopamine category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Reward System, Reinforcement Signal, Prediction Error] duplicate_of: none source_trust_level: A confidence_score: 0.85 verification_status: applied tags: [neuroscience, reinforcement-learning, motivation, ux] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: rl --- # Dopamine ## 매 한 줄 > **"매 reward prediction error 의 signal"**. 매 dopamine 의 modern view 는 pleasure 의 X, 매 *expected vs actual reward 의 차이* 의 broadcast. 매 Schultz (1997) 의 monkey VTA recording 의 RL 의 TD-error 의 isomorphism 의 establish. 매 product UX, addiction design, RL algorithm 의 shared substrate. ## 매 핵심 ### 매 RPE (Reward Prediction Error) - **Positive RPE**: 매 expected 보다 better. 매 dopamine burst. - **Zero RPE**: 매 fully predicted. 매 baseline firing. - **Negative RPE**: 매 expected 보다 worse. 매 firing dip. ### 매 RL 의 TD-error 와 의 mapping - 매 δ = r + γV(s') − V(s). - 매 dopamine neuron 의 firing rate 의 δ 의 encode (Schultz, Dayan, Montague 1997). ### 매 응용 1. Variable-ratio schedule (slot machine, social media feed) — 매 maximal RPE. 2. Habit formation (intermittent reward). 3. Anhedonia / addiction 의 dopaminergic dysregulation. 4. RL agent design (curiosity, intrinsic motivation). ## 💻 패턴 ### TD-learning (dopamine-analog) ```python import numpy as np def td_update(V, s, r, s_next, alpha=0.1, gamma=0.9): """V: value table. δ = TD error = 'dopamine signal'.""" delta = r + gamma * V[s_next] - V[s] # ← RPE V[s] += alpha * delta return delta # log this; it's the 'dopamine' V = np.zeros(10) for episode in range(1000): s, r, s_next = sample_transition() rpe = td_update(V, s, r, s_next) ``` ### Curiosity-driven exploration (intrinsic dopamine analog) ```python # Random Network Distillation (Burda 2018) class RND(nn.Module): def __init__(self): super().__init__() self.target = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 64)) self.predictor = nn.Sequential(nn.Linear(64, 128), nn.ReLU(), nn.Linear(128, 64)) for p in self.target.parameters(): p.requires_grad = False def intrinsic_reward(self, obs): with torch.no_grad(): target = self.target(obs) pred = self.predictor(obs) return ((target - pred) ** 2).mean(-1) # novelty bonus ``` ### Variable-ratio schedule simulator ```python def variable_ratio_session(p_reward=0.1, n_pulls=100): rpe_log = [] expected = p_reward # learned expectation for _ in range(n_pulls): r = 1.0 if np.random.rand() < p_reward else 0.0 rpe = r - expected expected += 0.05 * rpe # slow learning rpe_log.append(rpe) return rpe_log # Pattern: high-amplitude RPE persists → "addictive" engagement ``` ### Hyperbolic discounting (dopamine-future) ```python def hyperbolic_value(reward, delay, k=0.1): """Real human/animal — closer to hyperbolic than exponential.""" return reward / (1 + k * delay) ``` ### Opponent process (reward + aversion) ```python # Two-system: dopamine (reward) + serotonin (aversion / patience) def dual_system_update(V_reward, V_aversion, r_pos, r_neg, s, s_next, alpha=0.1, gamma=0.9): delta_reward = r_pos + gamma * V_reward[s_next] - V_reward[s] delta_aversion = r_neg + gamma * V_aversion[s_next] - V_aversion[s] V_reward[s] += alpha * delta_reward V_aversion[s] += alpha * delta_aversion return delta_reward, delta_aversion ``` ## 매 결정 기준 | 상황 | Insight | |---|---| | Habit-forming product | Variable-ratio reward (Slot machine schedule) | | Sustained engagement | Mix predictable + unpredictable wins | | Avoid burnout | Avoid pure RPE-maximization (ethical concern) | | RL exploration stuck | Add intrinsic reward (RND, ICM) | | Anhedonia in user | Reduce expectation, surprise with low-cost wins | **기본값**: 매 RPE-aware design — but 매 ethics 의 weight (manipulation 의 risk). ## 🔗 Graph - 부모: [[Reinforcement Learning]] - 응용: [[Habit Formation]] · [[Game Design]] · [[Recommender Systems]] - Adjacent: [[TD-Learning]] · [[Behavioral Economics]] ## 🤖 LLM 활용 **언제**: 매 product UX 의 retention mechanic 의 audit. 매 dark-pattern 의 detection. **언제 X**: 매 clinical advice. 매 LLM 의 medical claim 의 X. ## ❌ 안티패턴 - **Dopamine = pleasure 의 simplification**: 매 X. 매 RPE 의 signal — pleasure 는 separate (opioid). - **Pure exploitation (no novelty)**: 매 user 의 RPE 의 0 의 disengage. - **Manipulative dark pattern**: 매 ethical violation. 매 design 의 audit 의 mandatory. ## 🧪 검증 / 중복 - Verified (Schultz 1997 Science, Sutton & Barto 2018, Berridge 2007). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — RPE / TD-learning isomorphism + UX implication |