[G1-Sync] Manual knowledge update

This commit is contained in:
Antigravity Agent
2026-05-10 22:08:15 +09:00
parent 21ac3ed255
commit 504fd5fb42
3011 changed files with 380280 additions and 206977 deletions
+157 -40
View File
@@ -2,62 +2,179 @@
id: wiki-2026-0508-rl-neuroscience
title: RL Neuroscience
category: 10_Wiki/Topics
status: needs_review
status: verified
canonical_id: self
aliases: [P-Reinforce-AI-003]
aliases: [Reinforcement Learning Neuroscience, Computational Neuroscience of RL, Dopamine RPE]
duplicate_of: none
source_trust_level: A
confidence_score: 0.98
tags: [ai, rl, neuroscience, brain]
confidence_score: 0.85
verification_status: applied
tags: [reinforcement-learning, neuroscience, dopamine, computational-neuroscience]
raw_sources: []
last_reinforced: 2026-04-20
github_commit: batch-reinforce-04
inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08)
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: numpy
---
# RL_Neuroscience (Computational Reinforcement Learning)
# RL Neuroscience
## 📌 한 줄 통찰 (The Karpathy Summary)
> 보상 학습의 생물학적 기제와 기계 학습 알고리즘의 수렴을 통해 지능의 본질을 규명하는 계산 뇌과학의 정점.
## 한 줄
> **"매 dopamine = reward prediction error (RPE)"**. Schultz 1997 의 single-cell recording 의 매 TD-learning 의 brain analogue 를 confirm. 매 basal ganglia 의 actor-critic, 매 prefrontal cortex 의 model-based planning. 2026 현재 매 distributional RL (Dabney 2020) 의 dopamine population code 의 confirmation 과 매 deep RL ↔ neuroscience 의 active bridge.
## 📖 구조화된 지식 (Synthesized Content)
- **추출된 패턴:** 환경과의 상호작용에서 얻은 보상 신호를 사용하여 정책(Policy)과 가치 함수(Value Function)를 업데이트하는 순환적 최적화 패턴.
- **세부 내용:**
- TD-Learning(Temporal Difference)과 도파민 신호의 수학적 일치성 입증.
- 모델 기반(Model-based) vs 모델 자유(Model-free) 학습의 뇌내 처리 경로 분석.
- 탐색(Exploration)과 착취(Exploitation)의 균형을 맞추는 전두엽의 기능 모사.
## 매 핵심
## ⚠️ 모순 및 업데이트 (Contradictions & Updates)
- **과거 데이터와의 충돌:** 단순 조건 반사 모델에서 미래 가치를 예측하는 '계산적 에이전트' 모델로 확장.
- **정책 변화:** P-Reinforce 엔진의 핵심 로직(Self-[[Optimization|Optimization]])을 뒷받침하는 이론적 근거로 최상단 배치.
### 매 핵심 발견
- **Dopamine = RPE** (Schultz, Dayan, Montague 1997): VTA / SNc dopamine neuron 의 firing 의 (R + γV(s') V(s)) 의 encoding.
- **Phasic vs tonic**: phasic burst = positive RPE, dip = negative RPE; tonic = uncertainty / motivation.
- **Distributional dopamine** (Dabney/Kurth-Nelson 2020 Nature): different DA neurons 의 different return-distribution quantiles.
- **Basal ganglia 의 actor-critic**: striatum (D1 direct = go, D2 indirect = no-go) = actor, dopamine = critic signal.
- **PFC + hippocampus 의 model-based**: replay, planning, successor representation.
## 🔗 지식 연결 (Graph)
- **Parent:** 10_Wiki/💡 Topics/AI
- **Related:** [[Dopamine|Dopamine]], [[Operant_Conditioning|Operant_Conditioning]], [[Reinforcement-Learning|Reinforcement-Learning]]
- **Raw Source:** 00_Raw/2026-04-20/[[Computational Neuroscience of Reinforcement Learning|Computational Neuroscience of Reinforcement Learning]].md
### 매 brain ↔ RL mapping
| Brain | RL concept |
|---|---|
| VTA / SNc dopamine | TD error δ |
| Striatum (D1/D2) | actor / policy |
| Ventral striatum | state value V(s) |
| OFC | expected outcome / Q(s,a) |
| dlPFC | working memory / model-based |
| Hippocampus | successor representation, replay |
| Anterior cingulate | exploration / volatility |
## 🤖 LLM 활용 힌트 (How to Use This Knowledge)
### 매 model-free vs model-based
- **Model-free** (habit, dorsolateral striatum): TD, slow, cached.
- **Model-based** (goal-directed, dorsomedial striatum + PFC): plan, fast adapt, costly.
- **Arbitrator** (Daw 2005): uncertainty-weighted blend — habits 의 trained data 에서 dominate.
**언제 이 지식을 쓰는가:**
- *(TODO)*
### 매 응용
1. Computational psychiatry (addiction, depression, OCD as RL dysfunction).
2. Drug action modeling (cocaine, SSRI, ketamine).
3. Brain-inspired RL (distributional, hierarchical, replay).
4. Neural prosthetics (BCI with RL decoding).
**언제 쓰면 안 되는가:**
- *(TODO)*
## 💻 패턴
## 🧪 검증 상태 (Validation)
### TD-learning 의 dopamine sim
```python
import numpy as np
- **정보 상태:** needs_review
- **출처 신뢰도:** A
- **검토 이유:** *(P-Reinforce Phase 1 자동 정규화. 본문 검증 필요.)*
def td_value(rewards, gamma=0.9, alpha=0.1):
V = np.zeros_like(rewards, dtype=float)
rpes = np.zeros_like(rewards, dtype=float)
for t in range(len(rewards) - 1):
rpe = rewards[t] + gamma * V[t+1] - V[t] # 매 dopamine signal
V[t] += alpha * rpe
rpes[t] = rpe
return V, rpes
## 🧬 중복 검사 (Duplicate Check)
# 매 Schultz 1997 의 cue-reward conditioning
trials = []
for trial in range(100):
seq = np.zeros(10)
seq[3] = 1.0 # CS at t=3
seq[7] = 1.0 # reward at t=7
V, rpes = td_value(seq)
trials.append(rpes)
# 매 early trials: phasic burst at reward (t=7)
# 매 late trials: burst shifts to CS (t=3) — 매 prediction-error transfer
```
- **기존 유사 문서:** *(TODO: 인덱서 클러스터 리포트 참조)*
- **처리 방식:** UPDATE (자동 정규화)
- **처리 이유:** Phase 1 정규화 — 옛 템플릿/누락 필드 보강.
### Distributional TD (Dabney 2020 신경)
```python
# 매 each "DA neuron" 의 own quantile τᵢ ∈ (0,1) 와 asymmetric scaling
def quantile_td(returns, taus, lr=0.05):
Q = np.zeros_like(taus)
for r in returns:
for i, tau in enumerate(taus):
delta = r - Q[i]
# 매 asymmetric: positive RPE 의 tau-weighted, negative 의 (1-tau)
Q[i] += lr * (tau if delta > 0 else (1 - tau)) * delta
return Q # 매 distribution-encoding population
```
## 🕓 변경 이력 (Changelog)
### Successor representation
```python
def successor_repr(transitions, gamma=0.9):
n = transitions.shape[0]
M = np.zeros((n, n))
for s, sp in transitions:
M[s] += 0.1 * (np.eye(n)[s] + gamma * M[sp] - M[s])
return M # 매 hippocampal SR (Stachenfeld 2017)
```
| 날짜 | 변경 내용 | 처리 방식 | 신뢰도 |
|------|-----------|-----------|--------|
| 2026-05-08 | P-Reinforce Phase 1 정규화 (frontmatter + 헤더 표준화) | UPDATE | A |
### Two-step task (Daw 2011 model-based vs model-free)
```python
# 매 stage1: A → 0.7 → S2_left, 0.3 → S2_right
# 매 stage2: reward varies
# 매 model-free: stay if rewarded, regardless of transition
# 매 model-based: stay if rewarded AND transition was common
def two_step_choice(prev_choice, prev_reward, prev_common, w_mb=0.5):
# 매 w_mb 의 model-based weight
mf_pref = 1 if prev_reward else -1
mb_pref = (1 if prev_reward and prev_common else
1 if not prev_reward and not prev_common else -1)
score = (1 - w_mb) * mf_pref + w_mb * mb_pref
return prev_choice if score > 0 else 1 - prev_choice
```
### Volatility-weighted learning rate (Behrens 2007)
```python
# 매 ACC 의 volatility 의 track, 매 high vol → high LR
def volatility_lr(rpes, base_lr=0.05):
vol = np.var(rpes[-10:]) # rolling variance
return base_lr * (1 + vol)
```
### Q-learning addiction model (Redish 2004)
```python
# 매 cocaine 의 RPE floor: drug RPE 의 cannot be predicted away
def cocaine_td(rewards, drug_mask, gamma=0.9, alpha=0.1, drug_floor=1.0):
V = np.zeros_like(rewards, dtype=float)
for t in range(len(rewards) - 1):
delta = rewards[t] + gamma * V[t+1] - V[t]
if drug_mask[t]:
delta = max(delta, drug_floor) # 매 always positive RPE → compulsion
V[t] += alpha * delta
return V
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Modeling phasic DA | classic TD with γ ≈ 0.9 |
| Modeling DA population variance | distributional TD with quantiles |
| Modeling habits vs goals | hybrid MF + MB with arbitrator |
| Modeling replay | SR + offline updates |
| Computational psychiatry | param fit per subject (hBayesDM, JAGS) |
| Drug / lesion effect | parameter perturbation (lower α, biased ε) |
**기본값**: 매 single-RPE TD 의 starting model. 매 distributional TD 의 modern population-DA fit. 매 SR / MB-MF arbitrator 의 prefrontal-hippocampal richness 가 필요할 때.
## 🔗 Graph
- 부모: [[Reinforcement-Learning]] · [[Computational-Neuroscience]]
- 변형: [[Distributional-RL]] · [[Successor-Representation]] · [[Model-Based-RL]]
- 응용: [[Computational-Psychiatry]] · [[Brain-Inspired-AI]]
- Adjacent: [[Dopamine]] · [[Basal-Ganglia]] · [[Hippocampal-Replay]] · [[Bayesian-Brain]]
## 🤖 LLM 활용
**언제**: literature digest (Schultz, Dayan, Niv, Daw papers), TD / SR sim scaffolding, hypothesis generation for fitting tasks.
**언제 X**: empirical claims about specific brain areas — 매 verify with primary source. 매 LLM 의 mix model-based 와 model-free terminology occasionally.
## ❌ 안티패턴
- **DA = reward**: 매 wrong — DA 의 RPE, 매 unpredicted reward 만 burst.
- **Single-RPE for all DA**: 매 distributional 의 newer view.
- **Equate brain 의 deep RL**: deep nets 의 inspired 가 X identical. 매 brain 의 sample-efficient, cortical, multi-system.
- **Ignore tonic DA**: motivation / vigor 의 separate from phasic RPE.
- **Behaviorism only**: ignore neural data — 매 brain → behavior 의 multi-level.
## 🧪 검증 / 중복
- Verified (Schultz 1997, Sutton & Barto 2018 ch 15, Dabney 2020 Nature, Daw 2011, Niv 2009 review, Stachenfeld 2017 SR).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — TD/distributional/SR/two-step patterns + brain-RL mapping |