--- id: wiki-2026-0508-pomdp title: POMDP category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Partially-Observable-MDP, Partially-Observable-Markov-Decision-Process] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [reinforcement-learning, planning, belief-state, pomdp, decision-making] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch-pomdp_py --- # POMDP ## 매 한 줄 > **"매 MDP + observation noise"**. POMDP 는 agent 가 state 를 직접 관측하지 못하고 noisy observation 만 받는 경우의 decision-making 수학 framework — tuple ``. 매 belief state (state 위 distribution) 를 유지하며 행동, dialogue / robotics / medical / game-AI 의 standard model. ## 매 핵심 ### 매 정의 - **S**: state space (hidden). - **A**: action space. - **T(s'|s,a)**: transition. - **R(s,a)**: reward. - **Ω**: observation space. - **O(o|s',a)**: observation model. - **γ ∈ [0,1)**: discount. ### 매 belief state - `b(s) = P(s | history)`, sufficient statistic of history. - update: `b'(s') ∝ O(o|s',a) Σ_s T(s'|s,a) b(s)`. - POMDP = MDP on belief space (continuous, high-dim). ### 매 solver family 1. **Exact**: value iteration on belief (PWLC), tractable only for tiny S. 2. **Point-based** (PBVI, SARSOP, Perseus): sample beliefs, backup. 3. **Online MCTS**: POMCP (Silver 2010), DESPOT — 매 large state, online planning. 4. **Deep RL**: DRQN, R2D2, Dreamer (latent belief = RNN state) — 매 modern default. 5. **Bayes-Adaptive**: BAMCP, learn dynamics in addition. ### 매 vs MDP - MDP: full observability, policy `π(s) → a`. - POMDP: policy `π(b) → a` or `π(history) → a`. - **매 함정**: training MDP policy on observations directly = wrong (Markov violation). ### 매 응용 1. dialogue system — user goal hidden. 2. robotics — sensor noise, occlusion. 3. medical treatment — patient state from labs/symptoms. 4. game AI — fog-of-war (StarCraft, Poker, [[Operation- Western Sun]]). 5. autonomous driving — pedestrian intent. ## 💻 패턴 ### Tiger problem (canonical POMDP) ```python # States: tiger_left, tiger_right # Actions: open_left, open_right, listen # Obs: hear_left, hear_right (85% accurate after listen) import numpy as np S = ["TL", "TR"] A = ["OL", "OR", "LISTEN"] O = ["HL", "HR"] def T(s, a): if a in ("OL", "OR"): return {"TL": 0.5, "TR": 0.5} # reset return {s: 1.0} def R(s, a): return {"LISTEN": -1, "OL": -100 if s == "TL" else 10, "OR": -100 if s == "TR" else 10}[a] def O_model(o, s, a): if a != "LISTEN": return 0.5 correct = (o == "HL" and s == "TL") or (o == "HR" and s == "TR") return 0.85 if correct else 0.15 ``` ### Belief update (Bayes filter) ```python def update_belief(b, a, o, S, T, O_model): b_new = {} for sp in S: prior = sum(T(s, a).get(sp, 0) * b[s] for s in S) b_new[sp] = O_model(o, sp, a) * prior Z = sum(b_new.values()) return {s: p / Z for s, p in b_new.items()} ``` ### Particle filter (continuous / large S) ```python import numpy as np class ParticleBelief: def __init__(self, particles): self.p = list(particles) def update(self, a, o, sample_T, O_model): new = [] for s in self.p: sp = sample_T(s, a) w = O_model(o, sp, a) new.append((sp, w)) # resample ws = np.array([w for _, w in new]) ws = ws / ws.sum() idx = np.random.choice(len(new), len(new), p=ws) self.p = [new[i][0] for i in idx] ``` ### POMCP (online MCTS on history) ```python import math, random from collections import defaultdict class POMCP: def __init__(self, gen, c=1.0, gamma=0.95): self.gen = gen # generator: (s, a) -> (s', o, r) self.c, self.gamma = c, gamma self.N = defaultdict(int); self.V = defaultdict(float) def search(self, belief, depth=20, sims=500): for _ in range(sims): s = random.choice(belief) self._sim(s, (), depth) return max(actions, key=lambda a: self.V[((), a)]) def _sim(self, s, h, d): if d == 0: return 0 a = self._ucb(h) sp, o, r = self.gen(s, a) R = r + self.gamma * self._sim(sp, h + (a, o), d - 1) self.N[(h, a)] += 1 self.V[(h, a)] += (R - self.V[(h, a)]) / self.N[(h, a)] return R ``` ### DRQN (Deep RL with recurrent belief) ```python import torch, torch.nn as nn class DRQN(nn.Module): def __init__(self, obs_dim, n_act, hidden=128): super().__init__() self.enc = nn.Linear(obs_dim, hidden) self.gru = nn.GRU(hidden, hidden, batch_first=True) self.q = nn.Linear(hidden, n_act) def forward(self, obs_seq, h0=None): x = self.enc(obs_seq).relu() h, hN = self.gru(x, h0) return self.q(h), hN ``` ### pomdp_py (library) ```python import pomdp_py # Define PomdpProblem, then: planner = pomdp_py.POMCP(max_depth=20, num_sims=1000, discount_factor=0.95, exploration_const=50) action = planner.plan(agent) ``` ## 매 결정 기준 | 문제 크기 | Solver | |---|---| | 매 |S| < 20 | exact / SARSOP | | 매 |S| < 10⁴, offline | point-based (SARSOP) | | 매 large S, online | POMCP / DESPOT | | 매 raw obs (image) | DRQN / Dreamer | | 매 unknown dynamics | Bayes-Adaptive / model-based RL | **기본값**: SARSOP for tabular, Dreamer-V3 for pixel. ## 🔗 Graph - 부모: [[MDP]] · [[Reinforcement-Learning]] · [[Decision-Theory]] - 응용: [[Robotics]] · [[Operation- Western Sun]] - Adjacent: [[MCTS]] ## 🤖 LLM 활용 **언제**: 매 partial observability 문제 framing, belief-state design, solver 추천. **언제 X**: 매 fully-observable env — MDP 면 충분. ## ❌ 안티패턴 - **Treat obs as state**: Markov violation, policy 가 frame stacking 으로 hack 만 가능. - **Forget belief in test**: training 시 belief, deployment 시 raw obs 전달. - **Exact solver on large S**: PWLC explosion — point-based 로. - **No exploration in POMCP**: c=0 → greedy, belief 가 collapse. ## 🧪 검증 / 중복 - Verified (Kaelbling 1998, Silver 2010 POMCP, Hafner 2023 Dreamer-V3). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — definition + solver family + Tiger/POMCP/DRQN |