Files
2nd/10_Wiki/Topics/AI_and_ML/POMDP.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

6.3 KiB
Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack
id title category status canonical_id aliases duplicate_of source_trust_level confidence_score verification_status tags raw_sources last_reinforced github_commit tech_stack
wiki-2026-0508-pomdp POMDP 10_Wiki/Topics verified self
Partially-Observable-MDP
Partially-Observable-Markov-Decision-Process
none A 0.95 applied
reinforcement-learning
planning
belief-state
pomdp
decision-making
2026-05-10 pending
language framework
python pytorch-pomdp_py

POMDP

매 한 줄

"매 MDP + observation noise". POMDP 는 agent 가 state 를 직접 관측하지 못하고 noisy observation 만 받는 경우의 decision-making 수학 framework — tuple <S, A, T, R, Ω, O, γ>. 매 belief state (state 위 distribution) 를 유지하며 행동, dialogue / robotics / medical / game-AI 의 standard model.

매 핵심

매 정의

  • S: state space (hidden).
  • A: action space.
  • T(s'|s,a): transition.
  • R(s,a): reward.
  • Ω: observation space.
  • O(o|s',a): observation model.
  • γ ∈ [0,1): discount.

매 belief state

  • b(s) = P(s | history), sufficient statistic of history.
  • update: b'(s') ∝ O(o|s',a) Σ_s T(s'|s,a) b(s).
  • POMDP = MDP on belief space (continuous, high-dim).

매 solver family

  1. Exact: value iteration on belief (PWLC), tractable only for tiny S.
  2. Point-based (PBVI, SARSOP, Perseus): sample beliefs, backup.
  3. Online MCTS: POMCP (Silver 2010), DESPOT — 매 large state, online planning.
  4. Deep RL: DRQN, R2D2, Dreamer (latent belief = RNN state) — 매 modern default.
  5. Bayes-Adaptive: BAMCP, learn dynamics in addition.

매 vs MDP

  • MDP: full observability, policy π(s) → a.
  • POMDP: policy π(b) → a or π(history) → a.
  • 매 함정: training MDP policy on observations directly = wrong (Markov violation).

매 응용

  1. dialogue system — user goal hidden.
  2. robotics — sensor noise, occlusion.
  3. medical treatment — patient state from labs/symptoms.
  4. game AI — fog-of-war (StarCraft, Poker, Operation- Western Sun).
  5. autonomous driving — pedestrian intent.

💻 패턴

Tiger problem (canonical POMDP)

# States: tiger_left, tiger_right
# Actions: open_left, open_right, listen
# Obs: hear_left, hear_right (85% accurate after listen)
import numpy as np

S = ["TL", "TR"]
A = ["OL", "OR", "LISTEN"]
O = ["HL", "HR"]

def T(s, a):
    if a in ("OL", "OR"):
        return {"TL": 0.5, "TR": 0.5}   # reset
    return {s: 1.0}

def R(s, a):
    return {"LISTEN": -1,
            "OL": -100 if s == "TL" else 10,
            "OR": -100 if s == "TR" else 10}[a]

def O_model(o, s, a):
    if a != "LISTEN":
        return 0.5
    correct = (o == "HL" and s == "TL") or (o == "HR" and s == "TR")
    return 0.85 if correct else 0.15

Belief update (Bayes filter)

def update_belief(b, a, o, S, T, O_model):
    b_new = {}
    for sp in S:
        prior = sum(T(s, a).get(sp, 0) * b[s] for s in S)
        b_new[sp] = O_model(o, sp, a) * prior
    Z = sum(b_new.values())
    return {s: p / Z for s, p in b_new.items()}

Particle filter (continuous / large S)

import numpy as np

class ParticleBelief:
    def __init__(self, particles): self.p = list(particles)
    def update(self, a, o, sample_T, O_model):
        new = []
        for s in self.p:
            sp = sample_T(s, a)
            w = O_model(o, sp, a)
            new.append((sp, w))
        # resample
        ws = np.array([w for _, w in new])
        ws = ws / ws.sum()
        idx = np.random.choice(len(new), len(new), p=ws)
        self.p = [new[i][0] for i in idx]

POMCP (online MCTS on history)

import math, random
from collections import defaultdict

class POMCP:
    def __init__(self, gen, c=1.0, gamma=0.95):
        self.gen = gen      # generator: (s, a) -> (s', o, r)
        self.c, self.gamma = c, gamma
        self.N = defaultdict(int); self.V = defaultdict(float)
    def search(self, belief, depth=20, sims=500):
        for _ in range(sims):
            s = random.choice(belief)
            self._sim(s, (), depth)
        return max(actions, key=lambda a: self.V[((), a)])
    def _sim(self, s, h, d):
        if d == 0: return 0
        a = self._ucb(h)
        sp, o, r = self.gen(s, a)
        R = r + self.gamma * self._sim(sp, h + (a, o), d - 1)
        self.N[(h, a)] += 1
        self.V[(h, a)] += (R - self.V[(h, a)]) / self.N[(h, a)]
        return R

DRQN (Deep RL with recurrent belief)

import torch, torch.nn as nn

class DRQN(nn.Module):
    def __init__(self, obs_dim, n_act, hidden=128):
        super().__init__()
        self.enc = nn.Linear(obs_dim, hidden)
        self.gru = nn.GRU(hidden, hidden, batch_first=True)
        self.q = nn.Linear(hidden, n_act)
    def forward(self, obs_seq, h0=None):
        x = self.enc(obs_seq).relu()
        h, hN = self.gru(x, h0)
        return self.q(h), hN

pomdp_py (library)

import pomdp_py
# Define PomdpProblem, then:
planner = pomdp_py.POMCP(max_depth=20, num_sims=1000,
                        discount_factor=0.95, exploration_const=50)
action = planner.plan(agent)

매 결정 기준

문제 크기 Solver
S
S
매 large S, online POMCP / DESPOT
매 raw obs (image) DRQN / Dreamer
매 unknown dynamics Bayes-Adaptive / model-based RL

기본값: SARSOP for tabular, Dreamer-V3 for pixel.

🔗 Graph

🤖 LLM 활용

언제: 매 partial observability 문제 framing, belief-state design, solver 추천. 언제 X: 매 fully-observable env — MDP 면 충분.

안티패턴

  • Treat obs as state: Markov violation, policy 가 frame stacking 으로 hack 만 가능.
  • Forget belief in test: training 시 belief, deployment 시 raw obs 전달.
  • Exact solver on large S: PWLC explosion — point-based 로.
  • No exploration in POMCP: c=0 → greedy, belief 가 collapse.

🧪 검증 / 중복

  • Verified (Kaelbling 1998, Silver 2010 POMCP, Hafner 2023 Dreamer-V3).
  • 신뢰도 A.

🕓 Changelog

날짜 변경
2026-05-08 Phase 1
2026-05-10 Manual cleanup — definition + solver family + Tiger/POMCP/DRQN