"매 MDP + observation noise". POMDP 는 agent 가 state 를 직접 관측하지 못하고 noisy observation 만 받는 경우의 decision-making 수학 framework — tuple <S, A, T, R, Ω, O, γ>. 매 belief state (state 위 distribution) 를 유지하며 행동, dialogue / robotics / medical / game-AI 의 standard model.
매 핵심
매 정의
S: state space (hidden).
A: action space.
T(s'|s,a): transition.
R(s,a): reward.
Ω: observation space.
O(o|s',a): observation model.
γ ∈ [0,1): discount.
매 belief state
b(s) = P(s | history), sufficient statistic of history.
update: b'(s') ∝ O(o|s',a) Σ_s T(s'|s,a) b(s).
POMDP = MDP on belief space (continuous, high-dim).
매 solver family
Exact: value iteration on belief (PWLC), tractable only for tiny S.
importmath,randomfromcollectionsimportdefaultdictclassPOMCP:def__init__(self,gen,c=1.0,gamma=0.95):self.gen=gen# generator: (s, a) -> (s', o, r)self.c,self.gamma=c,gammaself.N=defaultdict(int);self.V=defaultdict(float)defsearch(self,belief,depth=20,sims=500):for_inrange(sims):s=random.choice(belief)self._sim(s,(),depth)returnmax(actions,key=lambdaa:self.V[((),a)])def_sim(self,s,h,d):ifd==0:return0a=self._ucb(h)sp,o,r=self.gen(s,a)R=r+self.gamma*self._sim(sp,h+(a,o),d-1)self.N[(h,a)]+=1self.V[(h,a)]+=(R-self.V[(h,a)])/self.N[(h,a)]returnR