Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

7.1 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

RL Neuroscience

매 한 줄

"매 dopamine = reward prediction error (RPE)". Schultz 1997 의 single-cell recording 의 매 TD-learning 의 brain analogue 를 confirm. 매 basal ganglia 의 actor-critic, 매 prefrontal cortex 의 model-based planning. 2026 현재 매 distributional RL (Dabney 2020) 의 dopamine population code 의 confirmation 과 매 deep RL ↔ neuroscience 의 active bridge.

매 핵심

매 핵심 발견

Dopamine = RPE (Schultz, Dayan, Montague 1997): VTA / SNc dopamine neuron 의 firing 의 (R + γV(s') − V(s)) 의 encoding.
Phasic vs tonic: phasic burst = positive RPE, dip = negative RPE; tonic = uncertainty / motivation.
Distributional dopamine (Dabney/Kurth-Nelson 2020 Nature): different DA neurons 의 different return-distribution quantiles.
Basal ganglia 의 actor-critic: striatum (D1 direct = go, D2 indirect = no-go) = actor, dopamine = critic signal.
PFC + hippocampus 의 model-based: replay, planning, successor representation.

매 brain ↔ RL mapping

Brain	RL concept
VTA / SNc dopamine	TD error δ
Striatum (D1/D2)	actor / policy
Ventral striatum	state value V(s)
OFC	expected outcome / Q(s,a)
dlPFC	working memory / model-based
Hippocampus	successor representation, replay
Anterior cingulate	exploration / volatility

매 model-free vs model-based

Model-free (habit, dorsolateral striatum): TD, slow, cached.
Model-based (goal-directed, dorsomedial striatum + PFC): plan, fast adapt, costly.
Arbitrator (Daw 2005): uncertainty-weighted blend — habits 의 trained data 에서 dominate.

매 응용

Computational psychiatry (addiction, depression, OCD as RL dysfunction).
Drug action modeling (cocaine, SSRI, ketamine).
Brain-inspired RL (distributional, hierarchical, replay).
Neural prosthetics (BCI with RL decoding).

💻 패턴

TD-learning 의 dopamine sim

import numpy as np

def td_value(rewards, gamma=0.9, alpha=0.1):
    V = np.zeros_like(rewards, dtype=float)
    rpes = np.zeros_like(rewards, dtype=float)
    for t in range(len(rewards) - 1):
        rpe = rewards[t] + gamma * V[t+1] - V[t]    # 매 dopamine signal
        V[t] += alpha * rpe
        rpes[t] = rpe
    return V, rpes

# 매 Schultz 1997 의 cue-reward conditioning
trials = []
for trial in range(100):
    seq = np.zeros(10)
    seq[3] = 1.0   # CS at t=3
    seq[7] = 1.0   # reward at t=7
    V, rpes = td_value(seq)
    trials.append(rpes)
# 매 early trials: phasic burst at reward (t=7)
# 매 late trials: burst shifts to CS (t=3) — 매 prediction-error transfer

Distributional TD (Dabney 2020 신경)

# 매 each "DA neuron" 의 own quantile τᵢ ∈ (0,1) 와 asymmetric scaling
def quantile_td(returns, taus, lr=0.05):
    Q = np.zeros_like(taus)
    for r in returns:
        for i, tau in enumerate(taus):
            delta = r - Q[i]
            # 매 asymmetric: positive RPE 의 tau-weighted, negative 의 (1-tau)
            Q[i] += lr * (tau if delta > 0 else (1 - tau)) * delta
    return Q   # 매 distribution-encoding population

Successor representation

def successor_repr(transitions, gamma=0.9):
    n = transitions.shape[0]
    M = np.zeros((n, n))
    for s, sp in transitions:
        M[s] += 0.1 * (np.eye(n)[s] + gamma * M[sp] - M[s])
    return M   # 매 hippocampal SR (Stachenfeld 2017)

Two-step task (Daw 2011 model-based vs model-free)

# 매 stage1: A → 0.7 → S2_left,  0.3 → S2_right
# 매 stage2: reward varies
# 매 model-free: stay if rewarded, regardless of transition
# 매 model-based: stay if rewarded AND transition was common
def two_step_choice(prev_choice, prev_reward, prev_common, w_mb=0.5):
    # 매 w_mb 의 model-based weight
    mf_pref = 1 if prev_reward else -1
    mb_pref = (1 if prev_reward and prev_common else
               1 if not prev_reward and not prev_common else -1)
    score = (1 - w_mb) * mf_pref + w_mb * mb_pref
    return prev_choice if score > 0 else 1 - prev_choice

Volatility-weighted learning rate (Behrens 2007)

# 매 ACC 의 volatility 의 track, 매 high vol → high LR
def volatility_lr(rpes, base_lr=0.05):
    vol = np.var(rpes[-10:])     # rolling variance
    return base_lr * (1 + vol)

Q-learning addiction model (Redish 2004)

# 매 cocaine 의 RPE floor: drug RPE 의 cannot be predicted away
def cocaine_td(rewards, drug_mask, gamma=0.9, alpha=0.1, drug_floor=1.0):
    V = np.zeros_like(rewards, dtype=float)
    for t in range(len(rewards) - 1):
        delta = rewards[t] + gamma * V[t+1] - V[t]
        if drug_mask[t]:
            delta = max(delta, drug_floor)   # 매 always positive RPE → compulsion
        V[t] += alpha * delta
    return V

매 결정 기준

상황	Approach
Modeling phasic DA	classic TD with γ ≈ 0.9
Modeling DA population variance	distributional TD with quantiles
Modeling habits vs goals	hybrid MF + MB with arbitrator
Modeling replay	SR + offline updates
Computational psychiatry	param fit per subject (hBayesDM, JAGS)
Drug / lesion effect	parameter perturbation (lower α, biased ε)

기본값: 매 single-RPE TD 의 starting model. 매 distributional TD 의 modern population-DA fit. 매 SR / MB-MF arbitrator 의 prefrontal-hippocampal richness 가 필요할 때.

🔗 Graph

부모: Reinforcement-Learning · Computational-Neuroscience-RL
변형: Distributional-RL
Adjacent: Dopamine · Basal-Ganglia · Bayesian-Brain

🤖 LLM 활용

언제: literature digest (Schultz, Dayan, Niv, Daw papers), TD / SR sim scaffolding, hypothesis generation for fitting tasks. 언제 X: empirical claims about specific brain areas — 매 verify with primary source. 매 LLM 의 mix model-based 와 model-free terminology occasionally.

❌ 안티패턴

DA = reward: 매 wrong — DA 의 RPE, 매 unpredicted reward 만 burst.
Single-RPE for all DA: 매 distributional 의 newer view.
Equate brain 의 deep RL: deep nets 의 inspired 가 X identical. 매 brain 의 sample-efficient, cortical, multi-system.
Ignore tonic DA: motivation / vigor 의 separate from phasic RPE.
Behaviorism only: ignore neural data — 매 brain → behavior 의 multi-level.

🧪 검증 / 중복

Verified (Schultz 1997, Sutton & Barto 2018 ch 15, Dabney 2020 Nature, Daw 2011, Niv 2009 review, Stachenfeld 2017 SR).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — TD/distributional/SR/two-step patterns + brain-RL mapping

7.1 KiB Raw Blame History Unescape Escape