--- id: AI-RL-TD-001 category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 1.0 tags: [ai, reinforcement-learning, td-learning, temporal-difference, q-learning, sarsa, bellman-equation, machine-learning] last_reinforced: 2026-04-26 --- # Temporal Difference Learning (μ‹œκ°„μ°¨ ν•™μŠ΅) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "κ²°κ³Όκ°€ λ‚˜μ˜¬ λ•ŒκΉŒμ§€ κΈ°λ‹€λ¦¬λŠ” 인내 λŒ€μ‹ , λ§€ μˆœκ°„μ˜ 예츑이 λ‹€μŒ μˆœκ°„μ˜ μ‹€μ œμ™€ μ–Όλ§ˆλ‚˜ λ‹€λ₯Έμ§€(TD Error)λ₯Ό κ³„μ‚°ν•˜μ—¬ μ§€λŠ₯을 μ‹€μ‹œκ°„μœΌλ‘œ κ΅μ •ν•˜λΌ" β€” κ°•ν™”ν•™μŠ΅μ—μ„œ μ—ν”Όμ†Œλ“œκ°€ λλ‚˜μ§€ μ•Šμ•„λ„ ν˜„μž¬μ˜ 예츑치λ₯Ό λ°”νƒ•μœΌλ‘œ κ°€μΉ˜ ν•¨μˆ˜λ₯Ό μ—…λ°μ΄νŠΈν•˜λŠ” λΆ€νŠΈμŠ€νŠΈλž©(Bootstrapping) 기반의 ν•™μŠ΅ 방법둠. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** "Prediction-Correction Loop with Immediate Feedback" β€” λͺ¬ν…ŒμΉ΄λ₯Όλ‘œ λ°©μ‹μ²˜λŸΌ λκΉŒμ§€ 가보지 μ•Šκ³ λ„, λ‹€μŒ μƒνƒœμ˜ 보상과 μ˜ˆμƒ κ°€μΉ˜λ₯Ό 'μ°Έμ‘°κ°’'으둜 μ‚Όμ•„ ν˜„μž¬μ˜ κ°€μΉ˜ μΆ”μ •μΉ˜λ₯Ό μ‘°κΈˆμ”© μˆ˜μ •ν•΄ λ‚˜κ°€λŠ” νŒ¨ν„΄. - **핡심 λ©”μ»€λ‹ˆμ¦˜:** - **TD Error:** $r + \gamma V(s') - V(s)$. 즉, 'μ‹€μ œ 보상 + λ‹€μŒ μƒνƒœμ˜ μ˜ˆμƒ κ°€μΉ˜'와 'ν˜„μž¬ μ˜ˆμƒ κ°€μΉ˜'의 차이. - **Bootstrapping:** μžμ‹ μ˜ 이전 예츑치λ₯Ό μ‚¬μš©ν•˜μ—¬ ν˜„μž¬μ˜ 예츑치λ₯Ό μ—…λ°μ΄νŠΈν•˜λŠ” 방식. - **Algorithm Types:** 온-ν΄λ¦¬μ‹œ 방식인 SARSA와 μ˜€ν”„-ν΄λ¦¬μ‹œ 방식인 Q-Learning이 λŒ€ν‘œμ μž„. - **의의:** ν•™μŠ΅ 효율이 맀우 λ†’κ³  ν™˜κ²½κ³Όμ˜ μƒν˜Έμž‘μš© 쀑에 μ‹€μ‹œκ°„μœΌλ‘œ 배울 수 μžˆμ–΄, λ°”λ‘‘(AlphaGo)λΆ€ν„° λ‘œλ΄‡ μ œμ–΄κΉŒμ§€ ν˜„λŒ€ κ°•ν™”ν•™μŠ΅μ˜ 거의 λͺ¨λ“  성곡 μ‚¬λ‘€μ˜ 기술적 쀑좔가 됨. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** μ΄ˆκΈ°μ—λŠ” λ‹¨μˆœνžˆ 보상을 κ·ΉλŒ€ν™”ν•˜λŠ” μˆ˜λ‹¨μœΌλ‘œλ§Œ μ—¬κ²¨μ‘ŒμœΌλ‚˜, μ΄μ œλŠ” μΈκ°„μ˜ λ‡Œ 속 λ„νŒŒλ―Ό 체계가 TD Error와 μœ μ‚¬ν•˜κ²Œ λ™μž‘ν•œλ‹€λŠ” λ‡Œκ³Όν•™μ  발견과 κ²°ν•©ν•˜μ—¬ 'μ§€λŠ₯의 보편적 ν•™μŠ΅ 원리'둜 μž¬ν•΄μ„λ˜κ³  있음. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈλŠ” μ—μ΄μ „νŠΈμ˜ μž‘μ—… μ „λž΅ μ΅œμ ν™” μ‹œ, μž₯기적인 성곡 ν™•λ₯ μ„ λ§€ 단계 μ˜ˆμΈ‘ν•˜κ³  μˆ˜μ •ν•˜κΈ° μœ„ν•΄ κ³ λ„ν™”λœ TD ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ„ μ˜μ‚¬κ²°μ • μ—”μ§„μ˜ ν•΅μ‹¬μœΌλ‘œ ν™œμš©ν•¨. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Reinforcement-Learning]], [[Reward-Shaping-in-RL]], Bellman-Equation-Foundations, Deep-Learning-Foundations - **Raw Source:** 10_Wiki/Topics/AI/Temporal-Difference-Learning.md