--- id: wiki-2026-0508-reward-prediction-error title: Reward Prediction Error category: 10_Wiki/Topics status: needs_review canonical_id: self aliases: [P-Reinforce-AUTO-RWPE-001] duplicate_of: none source_trust_level: A confidence_score: 0.99 tags: [auto-reinforced, neuroscience, machine-learning, Dopamine, Reinforcement-Learning] raw_sources: [] last_reinforced: 2026-04-20 github_commit: pending inferred_by: Claude Opus 4.7 (auto-normalize 2026-05-08) --- # [[Reward Prediction Error|Reward Prediction Error]] ## ๐Ÿ“Œ ํ•œ ์ค„ ํ†ต์ฐฐ (The Karpathy Summary) > "ํ•™์Šต์„ ๋งŒ๋“œ๋Š” ์—”์ง„: ๊ธฐ๋Œ€ํ–ˆ๋˜ ๋ณด์ƒ๊ณผ ์‹ค์ œ ๋ฐ›์€ ๋ณด์ƒ ์‚ฌ์ด์˜ '์ฐจ์ด'๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ, ๊ทธ ๊ฐ„๊ทน๋งŒํผ ๋ฏธ๋ž˜์˜ ํ–‰๋™ ์ง€์นจ์„ ์ˆ˜์ •ํ•ด๋‚˜๊ฐ€๋Š” ๋‡Œ์™€ AI์˜ ๊ณตํ†ต ์ง€๋Šฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜." ## ๐Ÿ“– ๊ตฌ์กฐํ™”๋œ ์ง€์‹ (Synthesized Content) ๋ณด์ƒ ์˜ˆ์ธก ์˜ค๋ฅ˜(Reward Prediction Error, RPE)๋Š” ํ•™์Šต ์‹œ์Šคํ…œ์ด ํ˜„์žฌ์˜ ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ํ•ต์‹ฌ ์‹ ํ˜ธ์ž…๋‹ˆ๋‹ค. 1. **์ˆ˜ํ•™์  ์ •์˜ (TD Error)**: * $RPE = (\text{์‹ค์ œ ๋ณด์ƒ} + \text{์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ๋ฏธ๋ž˜ ๊ฐ€์น˜}) - \text{์˜ˆ์ƒํ–ˆ๋˜ ๊ฐ€์น˜}$ * **Positive Error (+)**: ๊ธฐ๋Œ€๋ณด๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์„ ๋•Œ. ํ–‰๋™ ํ™•๋ฅ ์„ ๋†’์ž„. * **Negative Error (-)**: ๊ธฐ๋Œ€๋ณด๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์  ๋•Œ. ํ–‰๋™ ํ™•๋ฅ ์„ ๋‚ฎ์ถค. 2. **์‹ ๊ฒฝ๊ณผํ•™์  ๊ตฌํ˜„ (๋„ํŒŒ๋ฏผ)**: * ๋‡Œ์˜ ์ค‘๋‡Œ ๋„ํŒŒ๋ฏผ ๋‰ด๋Ÿฐ์ด RPE๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ง (์А์ธ ์˜ ์—ฐ๊ตฌ). * ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋ณด์ƒ์ด ์ฃผ์–ด์งˆ ๋•Œ ๋„ํŒŒ๋ฏผ์ด ํญ๋ฐœํ•˜๊ณ , ์˜ˆ์ƒ๋Œ€๋กœ ๋‚˜์˜ค๋ฉด ์ž ์ž ํ•˜๋ฉฐ, ์˜ˆ์ƒํ–ˆ๋Š”๋ฐ ์•ˆ ๋‚˜์˜ค๋ฉด ๋„ํŒŒ๋ฏผ ํ™”๋ ฅ์ด ๊ธ‰๋ฝํ•จ. 3. **๊ฐ•ํ™”ํ•™์Šต์—์„œ์˜ ์—ญํ• **: * Q-Learning, Actor-Critic ๋“ฑ ๋Œ€๋ถ€๋ถ„์˜ ํ˜„๋Œ€ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ด ์˜ค์ฐจ๋ฅผ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์ตœ์ ํ™”ํ•จ. ## โš ๏ธ ๋ชจ์ˆœ ๋ฐ ์—…๋ฐ์ดํŠธ (Contradictions & Updates) - **๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ์™€์˜ ์ถฉ๋Œ**: ๊ณผ๊ฑฐ์—๋Š” ๋ณด์ƒ ๊ทธ ์ž์ฒด๊ฐ€ ํ•™์Šต์„ ์ผ์œผํ‚จ๋‹ค๊ณ  ๋ฏฟ์—ˆ์œผ๋‚˜, ํ˜„๋Œ€ ๊ณผํ•™์€ '๋ณด์ƒ ๊ทธ ์ž์ฒด'๊ฐ€ ์•„๋‹ˆ๋ผ '์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•œ ๋ณด์ƒ์˜ ์ฐจ์ด(์˜ค๋ฅ˜)'๊ฐ€ ์‹œ๋ƒ…์Šค ๊ฐ€์†Œ์„ฑ์„ ์œ ๋ฐœํ•˜๋Š” ์ง„์งœ ๋ฒ”์ธ์ž„์„ ์ฆ๋ช…ํ•จ. - **์ •์ฑ… ๋ณ€ํ™”(RL Update)**: ์ค‘๋…(Addiction)์ด๋‚˜ ๋„๋ฐ• ์ •์ฑ… ์ˆ˜๋ฆฝ ์‹œ, ๋‹จ์ˆœํžˆ ํ–‰์œ„๋ฅผ ๋ง‰๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋‡Œ์˜ RPE ์‹œ์Šคํ…œ์„ ๊ฐ€์งœ ๋ฐ์ดํ„ฐ๋กœ ๊ต๋ž€ํ•˜๋Š” '๋ณ€๋™ ๋ณด์ƒ(Slot Machine ๋ฉ”์ปค๋‹ˆ์ฆ˜)' ๋””์ž์ธ์„ ๊ทœ์ œํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ธฐ์ˆ  ์ •์ฑ…์ด ๊ฐ•ํ™”๋จ. ## ๐Ÿ”— ์ง€์‹ ์—ฐ๊ฒฐ (Graph) - [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]], [[Neurobiology-of-Reward|Neurobiology-of-Reward]], [[Probability Theory|Probability Theory]], Performance systems]], [[Ps-Reinforce|Ps-Reinforce]] - **Modern Tech/Tools**: [[Deep Q-Networks (DQN)|Deep Q-Networks (DQN)]], Dopamine level monitoring, [[Behavior|Behavior]]al RL models. --- ## ๐Ÿค– LLM ํ™œ์šฉ ํžŒํŠธ (How to Use This Knowledge) **์–ธ์ œ ์ด ์ง€์‹์„ ์“ฐ๋Š”๊ฐ€:** - *(TODO)* **์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋Š”๊ฐ€:** - *(TODO)* ## ๐Ÿงช ๊ฒ€์ฆ ์ƒํƒœ (Validation) - **์ •๋ณด ์ƒํƒœ:** needs_review - **์ถœ์ฒ˜ ์‹ ๋ขฐ๋„:** A - **๊ฒ€ํ†  ์ด์œ :** *(P-Reinforce Phase 1 ์ž๋™ ์ •๊ทœํ™”. ๋ณธ๋ฌธ ๊ฒ€์ฆ ํ•„์š”.)* ## ๐Ÿงฌ ์ค‘๋ณต ๊ฒ€์‚ฌ (Duplicate Check) - **๊ธฐ์กด ์œ ์‚ฌ ๋ฌธ์„œ:** *(TODO: ์ธ๋ฑ์„œ ํด๋Ÿฌ์Šคํ„ฐ ๋ฆฌํฌํŠธ ์ฐธ์กฐ)* - **์ฒ˜๋ฆฌ ๋ฐฉ์‹:** UPDATE (์ž๋™ ์ •๊ทœํ™”) - **์ฒ˜๋ฆฌ ์ด์œ :** Phase 1 ์ •๊ทœํ™” โ€” ์˜› ํ…œํ”Œ๋ฆฟ/๋ˆ„๋ฝ ํ•„๋“œ ๋ณด๊ฐ•. ## ๐Ÿ•“ ๋ณ€๊ฒฝ ์ด๋ ฅ (Changelog) | ๋‚ ์งœ | ๋ณ€๊ฒฝ ๋‚ด์šฉ | ์ฒ˜๋ฆฌ ๋ฐฉ์‹ | ์‹ ๋ขฐ๋„ | |------|-----------|-----------|--------| | 2026-05-08 | P-Reinforce Phase 1 ์ •๊ทœํ™” (frontmatter + ํ—ค๋” ํ‘œ์ค€ํ™”) | UPDATE | A |