--- id: RL-MDP-001 category: "10_Wiki/πŸ’‘ Topics/AI" confidence_score: 1.0 tags: [ai, reinforcement-learning, mdp, decision-making, bellman-equation, optimization] last_reinforced: 2026-04-26 --- # Markov Decision Process (MDP, 마λ₯΄μ½”ν”„ κ²°μ • κ³Όμ •) ## πŸ“Œ ν•œ 쀄 톡찰 (The Karpathy Summary) > "μ„Έμƒμ˜ λͺ¨λ“  μƒν˜Έμž‘μš©μ„ μƒνƒœ, 행동, λ³΄μƒμ˜ μˆœν™˜μœΌλ‘œ μˆ˜μΉ˜ν™”ν•˜κ³ , 미래 κ°€μΉ˜λ₯Ό κ·ΉλŒ€ν™”ν•˜λŠ” 졜적의 μ‹œλ‚˜λ¦¬μ˜€λ₯Ό μ„€κ³„ν•˜λΌ" β€” μ˜μ‚¬κ²°μ •μžκ°€ λΆˆν™•μ‹€ν•œ ν™˜κ²½ μ†μ—μ„œ μ΅œμ„ μ˜ μ •μ±…(Policy)을 μ°ΎκΈ° μœ„ν•΄ μ‚¬μš©ν•˜λŠ” μˆ˜ν•™μ  ν”„λ ˆμž„μ›Œν¬. ## πŸ“– κ΅¬μ‘°ν™”λœ 지식 (Synthesized Content) - **μΆ”μΆœλœ νŒ¨ν„΄:** "Sequential Decision Modeling" β€” 미래의 κ²°κ³Όκ°€ 였직 ν˜„μž¬μ˜ μƒνƒœμ™€ μ„ νƒμ—λ§Œ μ˜μ‘΄ν•œλ‹€λŠ” 마λ₯΄μ½”ν”„ μ„±μ§ˆ(Markov Property)을 λ°”νƒ•μœΌλ‘œ, λ§€ μˆœκ°„μ˜ 선택이 κ°€μ Έμ˜¬ μž₯기적인 이득을 κ³„μ‚°ν•˜κ³  μ΅œμ ν™”ν•˜λŠ” 동적 ν”„λ‘œκ·Έλž˜λ° νŒ¨ν„΄. - **5λŒ€ ꡬ성 μš”μ†Œ (S, A, P, R, $\gamma$):** - **State (S):** μ—μ΄μ „νŠΈκ°€ κ΄€μ°°ν•˜λŠ” ν™˜κ²½μ˜ μƒνƒœ. - **Action (A):** μ—μ΄μ „νŠΈκ°€ ν•  수 μžˆλŠ” ν–‰λ™μ˜ μ§‘ν•©. - **Transition Probability (P):** νŠΉμ • 행동 μ‹œ λ‹€μŒ μƒνƒœλ‘œ λ„˜μ–΄κ°ˆ ν™•λ₯ . - **Reward (R):** ν–‰λ™μ˜ 결과둜 λ°›λŠ” 즉각적인 ν”Όλ“œλ°±. - **Discount Factor ($\gamma$):** 미래 λ³΄μƒμ˜ ν˜„μž¬ κ°€μΉ˜λ₯Ό κ²°μ •ν•˜λŠ” λΉ„μœ¨. - **의의:** κ°•ν™”ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜(Q-Learning, Policy Gradient λ“±)이 무엇을 λͺ©ν‘œλ‘œ ν•™μŠ΅ν•΄μ•Ό ν•˜λŠ”μ§€ μ •μ˜ν•˜λŠ” 이둠적 ν† λŒ€. ## ⚠️ λͺ¨μˆœ 및 μ—…λ°μ΄νŠΈ (Contradictions & RL Update) - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λͺ¨λ“  ν™˜κ²½μ΄ MDP둜 μ™„λ²½νžˆ μ„€λͺ… κ°€λŠ₯ν•˜λ‹€λŠ” λ―ΏμŒμ—μ„œ λ²—μ–΄λ‚˜, 관츑이 λΆˆμ™„μ „ν•œ ν˜„μ‹€ 세계λ₯Ό λ°˜μ˜ν•œ POMDP(Partially Observable MDP) λ“± 더 λ³΅μž‘ν•œ λͺ¨λΈλ‘œμ˜ ν™•μž₯이 ν•„μˆ˜μ μ΄ 됨. - **μ •μ±… λ³€ν™”:** Antigravity μ—μ΄μ „νŠΈμ˜ 자율적 문제 ν•΄κ²° λ‘œμ§μ€ ν˜„μž¬ 상황을 MDP μƒνƒœλ‘œ μ •μ˜ν•˜κ³ , 각 도ꡬ μ‚¬μš©(Action)이 κ°€μ Έμ˜¬ 지식 κ°•ν™” κ²°κ³Ό(Reward)λ₯Ό μ˜ˆμΈ‘ν•˜μ—¬ 졜적의 경둜λ₯Ό 탐색함. ## πŸ”— 지식 μ—°κ²° (Graph) - [[Reinforcement-Learning|Reinforcement-Learning]], [[Markov-Chain-Monte-Carlo|Markov-Chain-Monte-Carlo]], Expected-Utility-Theory, [[Bellman-Equation|Bellman-Equation]] - **Raw Source:** 10_Wiki/Topics/AI/Markov-Decision-Process-MDP.md