--- id: P-REINFORCE-AI-BELLMAN category: "[[10_Wiki/πŸ’‘ Topics/AI]]" confidence_score: 0.99 tags: [Bellman Equation, RL, Dynamic Programming, MDP] last_reinforced: 2026-04-20 --- # [[Bellman-Equation]] (벨만 방정식) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "였늘의 κ°€μΉ˜λŠ” 였늘의 보상과 λ‚΄μΌμ˜ κΈ°λŒ€ κ°€μΉ˜λ₯Ό λ”ν•œ 것이닀." λ³΅μž‘ν•œ 미래λ₯Ό ν˜„μž¬μ˜ μ‹œμ μœΌλ‘œ μ†Œν™˜ν•˜λŠ” λ§ˆλ²•μ˜ μž¬κ·€ 곡식이닀. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **Principle of Optimality (μ΅œμ μ„±μ˜ 원리)**: - λ¦¬μ²˜λ“œ 벨만이 μ •μ˜ν•œ 원칙. 전체 κ²½λ‘œκ°€ 졜적이면, κ·Έ κ²½λ‘œμƒμ˜ μ–΄λ–€ λΆ€λΆ„ κ²½λ‘œλ„ μ΅œμ μ΄μ–΄μ•Ό ν•œλ‹€λŠ” 논리. 이λ₯Ό 톡해 큰 문제λ₯Ό μž‘μ€ λΆ€λΆ„ 문제둜 λ‚˜λˆ„λŠ” 동적 κ³„νšλ²•(DP)이 νƒ„μƒν–ˆλ‹€. - **MDP (Markov Decision Process)**: - ν˜„μž¬μ˜ μƒνƒœ(State)κ°€ 미래의 ν™•λ₯ μ„ κ²°μ •ν•œλ‹€λŠ” κ°€μ • ν•˜μ—, 보상(Reward)을 κ·ΉλŒ€ν™”ν•˜λŠ” μ •μ±…(Policy)을 μ°ΎκΈ° μœ„ν•œ μˆ˜ν•™μ  ν”„λ ˆμž„μ›Œν¬. - **Q-Learning의 κ·Όκ°„**: - μƒνƒœ-행동 κ°€μΉ˜ ν•¨μˆ˜ $Q(s, a)$λ₯Ό μ—…λ°μ΄νŠΈν•  λ•Œ 벨만 νƒ€κ²Ÿ(Bellman Target)을 μ‚¬μš©ν•˜μ—¬ μ—μ΄μ „νŠΈμ˜ μ§€λŠ₯을 μ μ§„μ μœΌλ‘œ κ°œμ„ ν•œλ‹€. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (RL Update) - 벨만 방정식은 μ™„λ²½ν•œ ν™˜κ²½(Full observability)을 κ°€μ •ν•  λ•Œ ν™˜μƒμ μ΄μ§€λ§Œ, 정보가 λˆ„λ½λœ ν˜„μ‹€(POMDP)μ—μ„œλŠ” κ·Όμ‚¬μΉ˜(Approximation)λ₯Ό μ°ΎκΈ° μœ„ν•œ λ”₯λŸ¬λ‹(DQN)과의 결합이 ν•„μˆ˜μ μ΄λ‹€. ## πŸ”— 지식 μ—°κ²° (Graph) - Related: [[Reinforcement Learning]] , [[Deep-Learning-Basics]] - Foundation: [[Information Theory]]