--- category: Unified tags: [auto-consolidated, technical-documentation] title: [[Markov Decision Process (MDP)|Markov Decision Process (MDP)]] last_updated: 2026-05-02 --- # [[Markov Decision Process (MDP)|Markov Decision Process (MDP)]] ## πŸ“Œ Brief Summary > 지식 μš”μ•½ 정보 μΆ”μΆœ 쀑... --- > "κ³Όκ±°λŠ” 묻지 λ§ˆμ„Έμš”, ν˜„μž¬μ˜ λ‚΄ λͺ¨μŠ΅μ΄ 미래λ₯Ό κ²°μ •ν•  λΏμž…λ‹ˆλ‹€." κ°•ν™”ν•™μŠ΅μ˜ 세계λ₯Ό μ •μ˜ν•˜λŠ” μˆ˜ν•™μ  λͺ¨λΈλ‘œ, μƒνƒœ, 행동, 보상, 전이 ν™•λ₯  λ„€ κ°€μ§€ μš”μ†Œλ‘œ 이루어진 μ˜μ‚¬κ²°μ •μ˜ ν‘œμ€€ ν”„λ ˆμž„μ›Œν¬λ‹€. --- > "μ„Έμƒμ˜ λͺ¨λ“  μƒν˜Έμž‘μš©μ„ μƒνƒœ, 행동, λ³΄μƒμ˜ μˆœν™˜μœΌλ‘œ μˆ˜μΉ˜ν™”ν•˜κ³ , 미래 κ°€μΉ˜λ₯Ό κ·ΉλŒ€ν™”ν•˜λŠ” 졜적의 μ‹œλ‚˜λ¦¬μ˜€λ₯Ό μ„€κ³„ν•˜λΌ" β€” μ˜μ‚¬κ²°μ •μžκ°€ λΆˆν™•μ‹€ν•œ ν™˜κ²½ μ†μ—μ„œ μ΅œμ„ μ˜ μ •μ±…(Policy)을 μ°ΎκΈ° μœ„ν•΄ μ‚¬μš©ν•˜λŠ” μˆ˜ν•™μ  ν”„λ ˆμž„μ›Œν¬. ## πŸ“– Core Content λ³Έλ¬Έ ꡬ쑰화 μž‘μ—… 쀑... --- - **Markov Property**: ν˜„μž¬ μƒνƒœ($S_t$)만 μ•Œλ©΄ 미래λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” 데 μΆ©λΆ„ν•˜λ‹€λŠ” κ°€μ •. (과거의 λͺ¨λ“  νžˆμŠ€ν† λ¦¬λŠ” ν˜„μž¬ μƒνƒœμ— 이미 ν•¨μΆ•λ˜μ–΄ μžˆλ‹€κ³  믿음) - **Five Components**: - **$S$ ([[State|State]])**: μ—μ΄μ „νŠΈκ°€ μ²˜ν•œ 상황. - **$A$ (Action)**: μ—μ΄μ „νŠΈκ°€ ν•  수 μžˆλŠ” 선택. - **$P$ (Transition Probability)**: νŠΉμ • 행동 μ‹œ λ‹€μŒ μƒνƒœλ‘œ 갈 ν™•λ₯ . - **$R$ (Reward)**: 결과에 λ”°λ₯Έ 보상. - **$\gamma$ (Discount Factor)**: 미래의 보상을 ν˜„μž¬ μ–Όλ§ˆμ˜ κ°€μΉ˜λ‘œ μΉ  것인가. - **Objective**: λˆ„μ  λ³΄μƒμ˜ ν•©(Return)을 μ΅œλŒ€ν™”ν•˜λŠ” 졜적의 μ •μ±…($\pi$)을 μ°ΎλŠ” 것. --- - **μΆ”μΆœλœ νŒ¨ν„΄:** "Sequential Decision Modeling" β€” 미래의 κ²°κ³Όκ°€ 였직 ν˜„μž¬μ˜ μƒνƒœμ™€ μ„ νƒμ—λ§Œ μ˜μ‘΄ν•œλ‹€λŠ” 마λ₯΄μ½”ν”„ μ„±μ§ˆ(Markov Property)을 λ°”νƒ•μœΌλ‘œ, λ§€ μˆœκ°„μ˜ 선택이 κ°€μ Έμ˜¬ μž₯기적인 이득을 κ³„μ‚°ν•˜κ³  μ΅œμ ν™”ν•˜λŠ” 동적 ν”„λ‘œκ·Έλž˜λ° νŒ¨ν„΄. - **5λŒ€ ꡬ성 μš”μ†Œ (S, A, P, R, $\gamma$):** - **[[State|State]] (S):** μ—μ΄μ „νŠΈκ°€ κ΄€μ°°ν•˜λŠ” ν™˜κ²½μ˜ μƒνƒœ. - **Action (A):** μ—μ΄μ „νŠΈκ°€ ν•  수 μžˆλŠ” ν–‰λ™μ˜ μ§‘ν•©. - **Transition Probability (P):** νŠΉμ • 행동 μ‹œ λ‹€μŒ μƒνƒœλ‘œ λ„˜μ–΄κ°ˆ ν™•λ₯ . - **Reward (R):** ν–‰λ™μ˜ 결과둜 λ°›λŠ” 즉각적인 ν”Όλ“œλ°±. - **Discount Factor ($\gamma$):** 미래 λ³΄μƒμ˜ ν˜„μž¬ κ°€μΉ˜λ₯Ό κ²°μ •ν•˜λŠ” λΉ„μœ¨. - **의의:** κ°•ν™”ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜(Q-Learning, Policy Gradient λ“±)이 무엇을 λͺ©ν‘œλ‘œ ν•™μŠ΅ν•΄μ•Ό ν•˜λŠ”μ§€ μ •μ˜ν•˜λŠ” 이둠적 ν† λŒ€. ## βš–οΈ Trade-offs & Caveats - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** μžλ™ν™” 엔진에 μ˜ν•΄ λ§€ν•‘λœ μ§€μ‹μœΌλ‘œ, μΆ”ν›„ μ •λ°€ 검증 ν•„μš”. - **μ •μ±… λ³€ν™”:** Graphics & Performance λΆ„μ•Όμ˜ μžλ™ μžμ‚°ν™” μˆ˜ν–‰. --- - ν˜„μ‹€μ˜ λ§Žμ€ λ¬Έμ œλŠ” 'ν˜„μž¬ μƒνƒœ'만으둜 νŒλ‹¨ν•˜κΈ° λΆˆμΆ©λΆ„ν•˜λ‹€(예: μΉ΄λ“œ κ²Œμž„μ—μ„œ μƒλŒ€μ˜ 패λ₯Ό λͺ¨λ₯Ό λ•Œ). 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ μƒνƒœκ°€ λΆ€λΆ„μ μœΌλ‘œλ§Œ κ΄€μ°°λœλ‹€λŠ” μ „μ œμ˜ **[[POMDP|POMDP]]**(Partially Observable MDP)κ°€ 더 ν˜„μ‹€μ μΈ λͺ¨λΈλ‘œ μ‚¬μš©λ˜λ©°, μ΄λŠ” LLM μ—μ΄μ „νŠΈμ˜ μ»¨ν…μŠ€νŠΈ μΆ”λ‘  μ„±λŠ₯과도 μ§κ²°λœλ‹€. --- - **κ³Όκ±° λ°μ΄ν„°μ™€μ˜ 좩돌:** λͺ¨λ“  ν™˜κ²½μ΄ MDP둜 μ™„λ²½νžˆ μ„€λͺ… κ°€λŠ₯ν•˜λ‹€λŠ” λ―ΏμŒμ—μ„œ λ²—μ–΄λ‚˜, 관츑이 λΆˆμ™„μ „ν•œ ν˜„μ‹€ 세계λ₯Ό λ°˜μ˜ν•œ [[POMDP|POMDP]](Partially Observable MDP) λ“± 더 λ³΅μž‘ν•œ λͺ¨λΈλ‘œμ˜ ν™•μž₯이 ν•„μˆ˜μ μ΄ 됨. - **μ •μ±… λ³€ν™”:** Antigravity μ—μ΄μ „νŠΈμ˜ 자율적 문제 ν•΄κ²° λ‘œμ§μ€ ν˜„μž¬ 상황을 MDP μƒνƒœλ‘œ μ •μ˜ν•˜κ³ , 각 도ꡬ μ‚¬μš©(Action)이 κ°€μ Έμ˜¬ 지식 κ°•ν™” κ²°κ³Ό(Reward)λ₯Ό μ˜ˆμΈ‘ν•˜μ—¬ 졜적의 경둜λ₯Ό 탐색함. ## πŸ”— Knowledge Connections - Raw Source: 00_Raw/2026-04-20/Markov Decision Process (MDP).md --- --- - Related: [[Reinforcement Learning (RL)|Reinforcement Learning (RL)]] , [[Bellman-Equation|Bellman-Equation]] - Complexity: POMDP (λΆ€λΆ„ κ΄€μΈ‘ κ°€λŠ₯ MDP) --- - [[Reinforcement-Learning|Reinforcement-Learning]], [[Markov-Chain-Monte-Carlo|Markov-Chain-Monte-Carlo]], Expected-Utility-Theory, [[Bellman-Equation|Bellman-Equation]] - **Raw Source:** 10_Wiki/Topics/AI/Markov-Decision-Process-MDP.md