--- id: TD-LEARN-001 category: "[[10_Wiki/πŸ’‘ Topics/AI]]" confidence_score: 1.0 tags: [reinforcement-learning, ai, temporal-difference, bellman-equation, machine-learning] last_reinforced: 2026-04-26 --- # [[Temporal Difference Learning (TD ν•™μŠ΅)]] ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "λκΉŒμ§€ 가보지 μ•Šμ•„λ„, ν•œ 걸음 λ’€μ˜ 미래λ₯Ό 톡해 ν˜„μž¬λ₯Ό μˆ˜μ •ν•˜λΌ" β€” μ—ν”Όμ†Œλ“œκ°€ λλ‚˜κΈ°λ₯Ό 기닀리지 μ•Šκ³ , ν˜„μž¬μ˜ μ˜ˆμΈ‘κ°’κ³Ό λ°”λ‘œ λ‹€μŒ λ‹¨κ³„μ˜ 보상 및 μ˜ˆμΈ‘κ°’ μ‚¬μ΄μ˜ 차이(TD Error)λ₯Ό μ΄μš©ν•΄ μ‹€μ‹œκ°„μœΌλ‘œ ν•™μŠ΅ν•˜λŠ” κ°•ν™”ν•™μŠ΅μ˜ 핡심 원리. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** λͺ¬ν…ŒμΉ΄λ₯Όλ‘œ(전체 κ²½ν—˜ ν•„μš”)와 동적 κ³„νšλ²•(ν™˜κ²½ λͺ¨λΈ ν•„μš”)의 μž₯점을 κ²°ν•©ν•˜μ—¬, ν™˜κ²½μ˜ λͺ¨λΈ 없이도 μ‹€μ‹œκ°„ κ²½ν—˜μ„ 톡해 κ°€μΉ˜ ν•¨μˆ˜λ₯Ό μ—…λ°μ΄νŠΈν•˜λŠ” λΆ€νŠΈμŠ€νŠΈλž˜ν•‘(Bootstrapping) νŒ¨ν„΄. - **μ„ΈλΆ€ λ‚΄μš©:** - **TD Error:** $Target(R_{t+1} + \gamma V(S_{t+1})) - Current\ Estimate(V(S_t))$. 이 였차λ₯Ό μ€„μ΄λŠ” 것이 λͺ©ν‘œ. - **Bootstrapping:** ν˜„μž¬μ˜ μ˜ˆμΈ‘κ°’μ„ λ°”νƒ•μœΌλ‘œ 또 λ‹€λ₯Έ μ˜ˆμΈ‘κ°’μ„ κ°±μ‹ ν•˜λŠ” 방식. - **TD(0):** λ°”λ‘œ λ‹€μŒ ν•œ λ‹¨κ³„μ˜ μ •λ³΄λ§Œ μ‚¬μš©ν•˜λŠ” κ°€μž₯ 기본적인 ν˜•νƒœ. - **TD($\lambda$):** μ—¬λŸ¬ 단계 μ•žμ˜ 정보λ₯Ό 가쀑 ν‰κ· ν•˜μ—¬ ν•™μŠ΅ 효율과 μ•ˆμ •μ„± μ‚¬μ΄μ˜ κ· ν˜•μ„ 맞좀 (Eligibility Traces). ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** μ—ν”Όμ†Œλ“œκ°€ μ™„λ£Œλ˜μ–΄μ•Όλ§Œ ν•™μŠ΅μ΄ κ°€λŠ₯ν–ˆλ˜ 초기 λͺ¨λΈλ“€μ˜ ν•œκ³„λ₯Ό λ„˜μ–΄, 연속적인 μž‘μ—… ν™˜κ²½μ—μ„œλ„ μ‹€μ‹œκ°„μœΌλ‘œ μ§€λŠ₯을 κ°œμ„ ν•  수 μžˆλŠ” ν† λŒ€ 마련. - **μ •μ±… λ³€ν™”:** Antigravity ν”„λ‘œμ νŠΈμ˜ 자율 ν•™μŠ΅ μ—μ΄μ „νŠΈλŠ” TD ν•™μŠ΅ 원리λ₯Ό ν™œμš©ν•˜μ—¬, κΈ΄ μž‘μ—… μ‹œν€€μŠ€ 쀑에도 각 λ‹¨κ³„μ˜ 성곡 κ°€λŠ₯성을 μ‹€μ‹œκ°„μœΌλ‘œ μ—…λ°μ΄νŠΈν•˜λ©° 졜적의 경둜λ₯Ό 탐색함. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Reinforcement-Learning]], [[Q-Learning-Foundations]], [[Bellman-Equation]], [[Monte-Carlo-Methods]] - **Raw Source:** [[10_Wiki/Topics/AI/Temporal-Difference-Learning.md]]