--- id: P-REINFORCE-AI-MARKOV category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 0.99 tags: [AI, ReinforcementLearning, MDP, Mathematics] last_reinforced: 2026-04-20 --- # [[Markov-Decision-Process (MDP)|Markov-Decision-Process (MDP)]] (마λ₯΄μ½”ν”„ κ²°μ • κ³Όμ •) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "κ³Όκ±°λŠ” 묻지 λ§ˆμ„Έμš”, ν˜„μž¬μ˜ λ‚΄ λͺ¨μŠ΅μ΄ 미래λ₯Ό κ²°μ •ν•  λΏμž…λ‹ˆλ‹€." κ°•ν™”ν•™μŠ΅μ˜ 세계λ₯Ό μ •μ˜ν•˜λŠ” μˆ˜ν•™μ  λͺ¨λΈλ‘œ, μƒνƒœ, 행동, 보상, 전이 ν™•λ₯  λ„€ κ°€μ§€ μš”μ†Œλ‘œ 이루어진 μ˜μ‚¬κ²°μ •μ˜ ν‘œμ€€ ν”„λ ˆμž„μ›Œν¬λ‹€. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **Markov Property**: ν˜„μž¬ μƒνƒœ($S_t$)만 μ•Œλ©΄ 미래λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” 데 μΆ©λΆ„ν•˜λ‹€λŠ” κ°€μ •. (과거의 λͺ¨λ“  νžˆμŠ€ν† λ¦¬λŠ” ν˜„μž¬ μƒνƒœμ— 이미 ν•¨μΆ•λ˜μ–΄ μžˆλ‹€κ³  믿음) - **Five Components**: - **$S$ (State)**: μ—μ΄μ „νŠΈκ°€ μ²˜ν•œ 상황. - **$A$ (Action)**: μ—μ΄μ „νŠΈκ°€ ν•  수 μžˆλŠ” 선택. - **$P$ (Transition Probability)**: νŠΉμ • 행동 μ‹œ λ‹€μŒ μƒνƒœλ‘œ 갈 ν™•λ₯ . - **$R$ (Reward)**: 결과에 λ”°λ₯Έ 보상. - **$\gamma$ (Discount Factor)**: 미래의 보상을 ν˜„μž¬ μ–Όλ§ˆμ˜ κ°€μΉ˜λ‘œ μΉ  것인가. - **Objective**: λˆ„μ  λ³΄μƒμ˜ ν•©(Return)을 μ΅œλŒ€ν™”ν•˜λŠ” 졜적의 μ •μ±…($\pi$)을 μ°ΎλŠ” 것. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (RL Update) - ν˜„μ‹€μ˜ λ§Žμ€ λ¬Έμ œλŠ” 'ν˜„μž¬ μƒνƒœ'만으둜 νŒλ‹¨ν•˜κΈ° λΆˆμΆ©λΆ„ν•˜λ‹€(예: μΉ΄λ“œ κ²Œμž„μ—μ„œ μƒλŒ€μ˜ 패λ₯Ό λͺ¨λ₯Ό λ•Œ). 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μƒνƒœκ°€ λΆ€λΆ„μ μœΌλ‘œλ§Œ κ΄€μ°°λœλ‹€λŠ” μ „μ œμ˜ **POMDP**(Partially Observable MDP)κ°€ 더 ν˜„μ‹€μ μΈ λͺ¨λΈλ‘œ μ‚¬μš©λ˜λ©°, μ΄λŠ” LLM μ—μ΄μ „νŠΈμ˜ μ»¨ν…μŠ€νŠΈ μΆ”λ‘  μ„±λŠ₯과도 μ§κ²°λœλ‹€. ## πŸ”— 지식 μ—°κ²° (Graph) - Related: [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]] , [[Bellman-Equation|Bellman-Equation]] - Complexity: POMDP (λΆ€λΆ„ κ΄€μΈ‘ κ°€λŠ₯ MDP)