--- id: wiki-2026-0508-computational-neuroscience-rl title: Computational Neuroscience & Reinforcement Learning category: 10_Wiki/Topics status: verified canonical_id: self aliases: [computational neuroscience RL, dopamine RPE, TD learning, basal ganglia, distributional RL, meta-RL] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [neuroscience, reinforcement-learning, dopamine, td-learning, distributional-rl, meta-learning, schultz, bayesian-brain] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: neuroscience / RL applicable_to: [RL Algorithms, Brain Disease Models, Bio-Inspired AI] --- # Computational Neuroscience & RL ## 매 한 줄 > **"매 dopamine = 매 reward prediction error"**. Schultz 의 finding (1990s) → 매 TD-learning 의 mathematical equivalent. 매 brain ↔ AI 의 deepest connection. 매 modern: distributional RL, model-based, meta-RL. ## 매 핵심 ### Schultz 의 dopamine - 매 reward 자체 X — 매 reward prediction error (RPE). - **Positive RPE** (better than expected): dopamine ↑. - **Negative RPE** (worse): dopamine ↓. - → 매 TD-error 의 exact match. ### TD Learning (Sutton-Barto) - $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ - 매 update: $V(s_t) \leftarrow V(s_t) + \alpha \delta_t$ - 매 dopamine 신호 = 매 δ. ### 매 brain 의 RL circuit - **Basal ganglia**: 매 action selection (actor). - **VTA / SNc**: 매 dopamine source. - **Striatum**: 매 value function (critic). - **Prefrontal cortex**: 매 model-based planning. - **Hippocampus**: 매 episodic / replay. ### 매 modern findings #### Distributional RL (Bellemare 2017, Dabney 2020) - 매 single value X — 매 distribution over rewards. - **Quantile Regression DQN, IQN**. - 매 brain 의 dopamine 의 distributional code. - 매 risk-sensitive. #### Model-based RL - 매 prefrontal cortex 의 simulate. - 매 Dreamer, MuZero. - 매 sample efficiency. #### Meta-RL - 매 prefrontal cortex 의 fast adaptation. - 매 PEARL, RL². #### Successor representation - 매 hippocampus 의 cognitive map. - 매 transfer learning. #### Replay - 매 hippocampus 의 sleep replay. - 매 RL 의 Experience Replay. ### 매 disease modeling - **Parkinson's**: 매 dopamine deficit → 매 RL 의 LR ↓. - **Addiction**: 매 RPE 의 hijack ([[Addiction-Neuroscience]]). - **Depression**: 매 negative RPE bias. - **OCD**: 매 model-based 의 over-engaged. - **Schizophrenia**: 매 prediction error precision 의 alter. ### 매 응용 1. **AI design**: 매 brain-inspired RL. 2. **Drug development**: 매 dopamine modulator. 3. **BCI**: 매 reward signal interface. 4. **Behavioral therapy**: 매 RPE 의 reframe. 5. **Marketing / nudge**: 매 reward schedule design. ## 💻 패턴 ### TD(0) value learning ```python import numpy as np class TDLearner: def __init__(self, n_states, alpha=0.1, gamma=0.95): self.V = np.zeros(n_states) self.alpha = alpha self.gamma = gamma def update(self, state, reward, next_state): td_error = reward + self.gamma * self.V[next_state] - self.V[state] self.V[state] += self.alpha * td_error return td_error # 매 dopamine 신호 의 analog ``` ### Q-Learning (off-policy) ```python class QLearner: def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.95, eps=0.1): self.Q = np.zeros((n_states, n_actions)) self.alpha, self.gamma, self.eps = alpha, gamma, eps def act(self, state): if np.random.random() < self.eps: return np.random.randint(self.Q.shape[1]) return self.Q[state].argmax() def update(self, s, a, r, s_next): td_target = r + self.gamma * self.Q[s_next].max() self.Q[s, a] += self.alpha * (td_target - self.Q[s, a]) ``` ### Distributional RL (C51) ```python import torch import torch.nn as nn class C51(nn.Module): def __init__(self, n_actions, n_atoms=51, v_min=-10, v_max=10): super().__init__() self.n_atoms = n_atoms self.support = torch.linspace(v_min, v_max, n_atoms) self.delta_z = (v_max - v_min) / (n_atoms - 1) self.net = nn.Sequential( nn.Linear(state_dim, 128), nn.ReLU(), nn.Linear(128, n_actions * n_atoms), ) def forward(self, state): logits = self.net(state).view(-1, n_actions, self.n_atoms) probs = F.softmax(logits, dim=-1) return probs # 매 distribution per action def q_values(self, state): probs = self(state) return (probs * self.support).sum(-1) ``` ### Eligibility trace (TD(λ)) ```python class TDLambda: def __init__(self, n_states, alpha=0.1, gamma=0.95, lam=0.9): self.V = np.zeros(n_states) self.e = np.zeros(n_states) # 매 eligibility trace self.alpha = alpha self.gamma = gamma self.lam = lam def update(self, state, reward, next_state): td_error = reward + self.gamma * self.V[next_state] - self.V[state] self.e *= self.gamma * self.lam self.e[state] += 1 self.V += self.alpha * td_error * self.e ``` ### Successor representation ```python def learn_sr(transitions, n_states, alpha=0.05, gamma=0.95): """매 SR(s, s') = expected discounted future visits to s'.""" M = np.eye(n_states) for s, s_next in transitions: I = np.eye(n_states)[s_next] M[s] += alpha * (I + gamma * M[s_next] - M[s]) return M # 매 V(s) = 매 M(s, .) @ R ``` ### Brain-inspired Dreamer (model-based) ```python class Dreamer: def __init__(self): self.world_model = WorldModel() # 매 prefrontal-like self.actor = Actor() self.critic = Critic() def imagine(self, init_state, horizon=15): """매 simulate trajectory in world model.""" states, actions, rewards = [init_state], [], [] for _ in range(horizon): a = self.actor(states[-1]) s_next, r = self.world_model(states[-1], a) actions.append(a) rewards.append(r) states.append(s_next) return states, actions, rewards def train(self, real_trajectories): # 매 1. world model 의 train (predict next + reward) self.world_model.train(real_trajectories) # 매 2. actor + critic 의 imagined trajectory 의 train for _ in range(updates): init = random.choice(real_trajectories)[0] states, actions, rewards = self.imagine(init) self.critic.train(states, rewards) self.actor.train(states, self.critic) ``` ### Disease modeling (Parkinson's) ```python def parkinson_simulation(td_learner, dopamine_deficit=0.5): """매 dopamine deficit = 매 effective LR ↓.""" td_learner.alpha *= (1 - dopamine_deficit) # 매 result: 매 slow learning, 매 reduced motivation. ``` ### RPE-based UI feedback (gamification done right) ```python def calibrate_reward(expected, actual): """매 user 의 expected vs actual 의 explicit feedback.""" rpe = actual - expected if rpe > 0.3: return 'GREAT — exceeded expectations!' elif rpe < -0.3: return 'Try again — fell short.' return 'On track.' ``` ## 🤔 결정 기준 | 응용 | Approach | |---|---| | Discrete env | Q-Learning / DQN | | Continuous | DDPG / SAC | | High-dim state | DQN / Rainbow | | Model-based | Dreamer / MuZero | | Risk-sensitive | Distributional RL | | Sparse reward | Curiosity / RND | | Few-shot | Meta-RL | | Brain disease modeling | RPE + lesion | **기본값**: 매 PPO / SAC + 매 distributional / replay. 매 brain-inspired = Dreamer. ## 🔗 Graph - 부모: [[Reinforcement-Learning]] · [[Computational-Neuroscience-RL|Computational-Neuroscience]] - 변형: [[TD-Learning]] · [[Distributional-RL]] · [[Meta-RL]] - 응용: [[Disease-Modeling]] - Adjacent: [[Bayesian-Brain-Hypothesis]] · [[Biological-Intelligence]] · [[Addiction-Neuroscience]] · [[Brain-Derived Neurotrophic Factor (BDNF)]] - Concept: [[Reward-Prediction-Error]] · [[Dopamine]] · [[Basal-Ganglia]] ## 🤖 LLM 활용 **언제**: 매 RL algorithm design. 매 brain-inspired AI. 매 disease model. 매 reward schedule. **언제 X**: 매 supervised pure problem. 매 specific clinical decision (의사). ## ❌ 안티패턴 - **Scalar reward 의 only**: 매 distributional 의 lose. - **No model (always free)**: 매 sample inefficient. - **TD-error 의 noise**: 매 unstable. - **Over-claim biological literal**: 매 metaphor 가 대부분. - **Disease cure expectation from model**: 매 simulation 의 limit. ## 🧪 검증 / 중복 - Verified (Schultz dopamine, Sutton-Barto RL book, Dabney distributional, Hafner Dreamer). - 신뢰도 A. - Related: [[Bayesian-Brain-Hypothesis]] · [[Biological-Intelligence]] · [[Addiction-Neuroscience]] · [[Brain-Derived Neurotrophic Factor (BDNF)]] · [[Reinforcement-Learning]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — Schultz + TD + 매 distributional / SR / Dreamer code + disease |