d8a80f6272
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해 끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은 과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업. 도구: Datacollect/scripts/link_reconcile_apply.mjs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
206 lines
6.3 KiB
Markdown
206 lines
6.3 KiB
Markdown
---
|
||
id: wiki-2026-0508-pomdp
|
||
title: POMDP
|
||
category: 10_Wiki/Topics
|
||
status: verified
|
||
canonical_id: self
|
||
aliases: [Partially-Observable-MDP, Partially-Observable-Markov-Decision-Process]
|
||
duplicate_of: none
|
||
source_trust_level: A
|
||
confidence_score: 0.95
|
||
verification_status: applied
|
||
tags: [reinforcement-learning, planning, belief-state, pomdp, decision-making]
|
||
raw_sources: []
|
||
last_reinforced: 2026-05-10
|
||
github_commit: pending
|
||
tech_stack:
|
||
language: python
|
||
framework: pytorch-pomdp_py
|
||
---
|
||
|
||
# POMDP
|
||
|
||
## 매 한 줄
|
||
> **"매 MDP + observation noise"**. POMDP 는 agent 가 state 를 직접 관측하지 못하고 noisy observation 만 받는 경우의 decision-making 수학 framework — tuple `<S, A, T, R, Ω, O, γ>`. 매 belief state (state 위 distribution) 를 유지하며 행동, dialogue / robotics / medical / game-AI 의 standard model.
|
||
|
||
## 매 핵심
|
||
|
||
### 매 정의
|
||
- **S**: state space (hidden).
|
||
- **A**: action space.
|
||
- **T(s'|s,a)**: transition.
|
||
- **R(s,a)**: reward.
|
||
- **Ω**: observation space.
|
||
- **O(o|s',a)**: observation model.
|
||
- **γ ∈ [0,1)**: discount.
|
||
|
||
### 매 belief state
|
||
- `b(s) = P(s | history)`, sufficient statistic of history.
|
||
- update: `b'(s') ∝ O(o|s',a) Σ_s T(s'|s,a) b(s)`.
|
||
- POMDP = MDP on belief space (continuous, high-dim).
|
||
|
||
### 매 solver family
|
||
1. **Exact**: value iteration on belief (PWLC), tractable only for tiny S.
|
||
2. **Point-based** (PBVI, SARSOP, Perseus): sample beliefs, backup.
|
||
3. **Online MCTS**: POMCP (Silver 2010), DESPOT — 매 large state, online planning.
|
||
4. **Deep RL**: DRQN, R2D2, Dreamer (latent belief = RNN state) — 매 modern default.
|
||
5. **Bayes-Adaptive**: BAMCP, learn dynamics in addition.
|
||
|
||
### 매 vs MDP
|
||
- MDP: full observability, policy `π(s) → a`.
|
||
- POMDP: policy `π(b) → a` or `π(history) → a`.
|
||
- **매 함정**: training MDP policy on observations directly = wrong (Markov violation).
|
||
|
||
### 매 응용
|
||
1. dialogue system — user goal hidden.
|
||
2. robotics — sensor noise, occlusion.
|
||
3. medical treatment — patient state from labs/symptoms.
|
||
4. game AI — fog-of-war (StarCraft, Poker, [[Operation- Western Sun]]).
|
||
5. autonomous driving — pedestrian intent.
|
||
|
||
## 💻 패턴
|
||
|
||
### Tiger problem (canonical POMDP)
|
||
```python
|
||
# States: tiger_left, tiger_right
|
||
# Actions: open_left, open_right, listen
|
||
# Obs: hear_left, hear_right (85% accurate after listen)
|
||
import numpy as np
|
||
|
||
S = ["TL", "TR"]
|
||
A = ["OL", "OR", "LISTEN"]
|
||
O = ["HL", "HR"]
|
||
|
||
def T(s, a):
|
||
if a in ("OL", "OR"):
|
||
return {"TL": 0.5, "TR": 0.5} # reset
|
||
return {s: 1.0}
|
||
|
||
def R(s, a):
|
||
return {"LISTEN": -1,
|
||
"OL": -100 if s == "TL" else 10,
|
||
"OR": -100 if s == "TR" else 10}[a]
|
||
|
||
def O_model(o, s, a):
|
||
if a != "LISTEN":
|
||
return 0.5
|
||
correct = (o == "HL" and s == "TL") or (o == "HR" and s == "TR")
|
||
return 0.85 if correct else 0.15
|
||
```
|
||
|
||
### Belief update (Bayes filter)
|
||
```python
|
||
def update_belief(b, a, o, S, T, O_model):
|
||
b_new = {}
|
||
for sp in S:
|
||
prior = sum(T(s, a).get(sp, 0) * b[s] for s in S)
|
||
b_new[sp] = O_model(o, sp, a) * prior
|
||
Z = sum(b_new.values())
|
||
return {s: p / Z for s, p in b_new.items()}
|
||
```
|
||
|
||
### Particle filter (continuous / large S)
|
||
```python
|
||
import numpy as np
|
||
|
||
class ParticleBelief:
|
||
def __init__(self, particles): self.p = list(particles)
|
||
def update(self, a, o, sample_T, O_model):
|
||
new = []
|
||
for s in self.p:
|
||
sp = sample_T(s, a)
|
||
w = O_model(o, sp, a)
|
||
new.append((sp, w))
|
||
# resample
|
||
ws = np.array([w for _, w in new])
|
||
ws = ws / ws.sum()
|
||
idx = np.random.choice(len(new), len(new), p=ws)
|
||
self.p = [new[i][0] for i in idx]
|
||
```
|
||
|
||
### POMCP (online MCTS on history)
|
||
```python
|
||
import math, random
|
||
from collections import defaultdict
|
||
|
||
class POMCP:
|
||
def __init__(self, gen, c=1.0, gamma=0.95):
|
||
self.gen = gen # generator: (s, a) -> (s', o, r)
|
||
self.c, self.gamma = c, gamma
|
||
self.N = defaultdict(int); self.V = defaultdict(float)
|
||
def search(self, belief, depth=20, sims=500):
|
||
for _ in range(sims):
|
||
s = random.choice(belief)
|
||
self._sim(s, (), depth)
|
||
return max(actions, key=lambda a: self.V[((), a)])
|
||
def _sim(self, s, h, d):
|
||
if d == 0: return 0
|
||
a = self._ucb(h)
|
||
sp, o, r = self.gen(s, a)
|
||
R = r + self.gamma * self._sim(sp, h + (a, o), d - 1)
|
||
self.N[(h, a)] += 1
|
||
self.V[(h, a)] += (R - self.V[(h, a)]) / self.N[(h, a)]
|
||
return R
|
||
```
|
||
|
||
### DRQN (Deep RL with recurrent belief)
|
||
```python
|
||
import torch, torch.nn as nn
|
||
|
||
class DRQN(nn.Module):
|
||
def __init__(self, obs_dim, n_act, hidden=128):
|
||
super().__init__()
|
||
self.enc = nn.Linear(obs_dim, hidden)
|
||
self.gru = nn.GRU(hidden, hidden, batch_first=True)
|
||
self.q = nn.Linear(hidden, n_act)
|
||
def forward(self, obs_seq, h0=None):
|
||
x = self.enc(obs_seq).relu()
|
||
h, hN = self.gru(x, h0)
|
||
return self.q(h), hN
|
||
```
|
||
|
||
### pomdp_py (library)
|
||
```python
|
||
import pomdp_py
|
||
# Define PomdpProblem, then:
|
||
planner = pomdp_py.POMCP(max_depth=20, num_sims=1000,
|
||
discount_factor=0.95, exploration_const=50)
|
||
action = planner.plan(agent)
|
||
```
|
||
|
||
## 매 결정 기준
|
||
| 문제 크기 | Solver |
|
||
|---|---|
|
||
| 매 |S| < 20 | exact / SARSOP |
|
||
| 매 |S| < 10⁴, offline | point-based (SARSOP) |
|
||
| 매 large S, online | POMCP / DESPOT |
|
||
| 매 raw obs (image) | DRQN / Dreamer |
|
||
| 매 unknown dynamics | Bayes-Adaptive / model-based RL |
|
||
|
||
**기본값**: SARSOP for tabular, Dreamer-V3 for pixel.
|
||
|
||
## 🔗 Graph
|
||
- 부모: [[MDP]] · [[Reinforcement-Learning]] · [[Decision Theory]]
|
||
- 응용: [[Robotics]] · [[Operation- Western Sun]]
|
||
- Adjacent: [[MCTS]]
|
||
|
||
## 🤖 LLM 활용
|
||
**언제**: 매 partial observability 문제 framing, belief-state design, solver 추천.
|
||
**언제 X**: 매 fully-observable env — MDP 면 충분.
|
||
|
||
## ❌ 안티패턴
|
||
- **Treat obs as state**: Markov violation, policy 가 frame stacking 으로 hack 만 가능.
|
||
- **Forget belief in test**: training 시 belief, deployment 시 raw obs 전달.
|
||
- **Exact solver on large S**: PWLC explosion — point-based 로.
|
||
- **No exploration in POMCP**: c=0 → greedy, belief 가 collapse.
|
||
|
||
## 🧪 검증 / 중복
|
||
- Verified (Kaelbling 1998, Silver 2010 POMCP, Hafner 2023 Dreamer-V3).
|
||
- 신뢰도 A.
|
||
|
||
## 🕓 Changelog
|
||
| 날짜 | 변경 |
|
||
|---|---|
|
||
| 2026-05-08 | Phase 1 |
|
||
| 2026-05-10 | Manual cleanup — definition + solver family + Tiger/POMCP/DRQN |
|