Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

4.5 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Neurobiology of Reward

매 한 줄

"매 dopamine 은 reward 자체 X, 매 reward prediction error 의 signal". 매 mesolimbic pathway (VTA → NAc) 가 매 expected vs actual outcome 의 차이를 encode 하며, 매 Schultz (1997) 가 매 발견. 매 modern RL (TD-learning, RLHF) 의 매 biological 의 root.

매 핵심

매 핵심 회로

VTA (ventral tegmental area): 매 dopamine 의 source neurons.
NAc (nucleus accumbens): 매 reward salience encoding.
PFC (prefrontal cortex): 매 value-based decision-making.
Amygdala: 매 valence (positive/negative) encoding.

매 RPE (Reward Prediction Error)

매 RPE = actual_reward - expected_reward.
매 positive RPE → dopamine burst → 매 reinforce action.
매 negative RPE → dopamine dip → 매 weaken action.
매 zero RPE (fully predicted reward) → no signal.

매 응용

RL algorithms: TD-learning 매 RPE 와 mathematically equivalent.
RLHF: 매 reward model 매 human preference RPE 의 proxy.
Addiction research: 매 hijacked dopamine → compulsive behavior.
UX design: 매 variable reward schedule (slot machine effect).

💻 패턴

TD-learning (Sutton & Barto, RL biological analog)

# Temporal Difference learning — RPE 매 update signal
import numpy as np

def td_update(V, state, next_state, reward, alpha=0.1, gamma=0.99):
    """V[s] ← V[s] + α(r + γV[s'] - V[s])"""
    rpe = reward + gamma * V[next_state] - V[state]  # 매 RPE
    V[state] += alpha * rpe
    return V, rpe

Dopamine neuron simulation

def dopamine_response(predicted_r, actual_r, baseline=1.0):
    """Schultz (1997) — 매 phasic firing rate."""
    rpe = actual_r - predicted_r
    return baseline * np.exp(rpe)  # scale baseline firing

RLHF reward model (modern bridge)

# transformers + trl
from trl import PPOTrainer, PPOConfig
from transformers import AutoModelForCausalLMWithValueHead

# 매 reward model = learned approximation of human RPE
config = PPOConfig(model_name="meta-llama/Llama-3.1-8B")
trainer = PPOTrainer(config, model, tokenizer, reward_model=reward_fn)
# Reward signal drives policy update → analog of dopamine update

Variable reward schedule (UX)

import random
def variable_reward(action_count):
    """매 intermittent reinforcement — strongest learning."""
    if random.random() < 0.3:  # 30% reward
        return "reward"
    return "no_reward"

Aversive learning (negative valence)

def negative_rpe_update(V, s, s_, r, alpha=0.1):
    """매 amygdala-mediated learning."""
    rpe = r + V[s_] - V[s]  # r typically negative
    V[s] += alpha * rpe
    return V

매 결정 기준

질문	답
매 dopamine 매 pleasure 인가?	X — RPE signal (wanting ≠ liking)
매 RL 의 reward 매 dopamine?	Functional analog yes (Schultz)
매 addiction 매 dopamine 과잉?	X — dysregulated RPE / hijacked salience
매 RLHF 매 brain-like?	At reward-update level yes (policy update)

기본값: 매 dopamine = "wanting / RPE", 매 opioid = "liking" 의 dissociation 기억.

🔗 Graph

부모: Neuroscience · Reinforcement-Learning
변형: Dopamine-Hypothesis · Wanting-vs-Liking
응용: RLHF · TD-Learning · Addiction
Adjacent: Operant-Conditioning · Habit-Formation

🤖 LLM 활용

언제: 매 reward modeling intuition, 매 RLHF reward shaping debugging, 매 motivation framework explanation. 언제 X: 매 clinical psychiatry — 매 specialist 영역.

❌ 안티패턴

Dopamine = pleasure: 매 popular myth — 실제는 RPE / wanting.
More dopamine = better: 매 tonic 과잉 매 schizophrenia, parkinson off-state.
Reward hacking: 매 RL agent 매 RPE exploit, 매 brain analog (addiction).

🧪 검증 / 중복

Verified (Schultz 1997 Science; Berridge & Robinson 1998 wanting/liking; Sutton & Barto RL Book 2018 2e).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — RPE biology + RL bridge + RLHF analog

4.5 KiB Raw Blame History Unescape Escape