Files
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

206 lines
6.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-pomdp
title: POMDP
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [Partially-Observable-MDP, Partially-Observable-Markov-Decision-Process]
duplicate_of: none
source_trust_level: A
confidence_score: 0.95
verification_status: applied
tags: [reinforcement-learning, planning, belief-state, pomdp, decision-making]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: python
framework: pytorch-pomdp_py
---
# POMDP
## 매 한 줄
> **"매 MDP + observation noise"**. POMDP 는 agent 가 state 를 직접 관측하지 못하고 noisy observation 만 받는 경우의 decision-making 수학 framework — tuple `<S, A, T, R, Ω, O, γ>`. 매 belief state (state 위 distribution) 를 유지하며 행동, dialogue / robotics / medical / game-AI 의 standard model.
## 매 핵심
### 매 정의
- **S**: state space (hidden).
- **A**: action space.
- **T(s'|s,a)**: transition.
- **R(s,a)**: reward.
- **Ω**: observation space.
- **O(o|s',a)**: observation model.
- **γ ∈ [0,1)**: discount.
### 매 belief state
- `b(s) = P(s | history)`, sufficient statistic of history.
- update: `b'(s') ∝ O(o|s',a) Σ_s T(s'|s,a) b(s)`.
- POMDP = MDP on belief space (continuous, high-dim).
### 매 solver family
1. **Exact**: value iteration on belief (PWLC), tractable only for tiny S.
2. **Point-based** (PBVI, SARSOP, Perseus): sample beliefs, backup.
3. **Online MCTS**: POMCP (Silver 2010), DESPOT — 매 large state, online planning.
4. **Deep RL**: DRQN, R2D2, Dreamer (latent belief = RNN state) — 매 modern default.
5. **Bayes-Adaptive**: BAMCP, learn dynamics in addition.
### 매 vs MDP
- MDP: full observability, policy `π(s) → a`.
- POMDP: policy `π(b) → a` or `π(history) → a`.
- **매 함정**: training MDP policy on observations directly = wrong (Markov violation).
### 매 응용
1. dialogue system — user goal hidden.
2. robotics — sensor noise, occlusion.
3. medical treatment — patient state from labs/symptoms.
4. game AI — fog-of-war (StarCraft, Poker, [[Operation- Western Sun]]).
5. autonomous driving — pedestrian intent.
## 💻 패턴
### Tiger problem (canonical POMDP)
```python
# States: tiger_left, tiger_right
# Actions: open_left, open_right, listen
# Obs: hear_left, hear_right (85% accurate after listen)
import numpy as np
S = ["TL", "TR"]
A = ["OL", "OR", "LISTEN"]
O = ["HL", "HR"]
def T(s, a):
if a in ("OL", "OR"):
return {"TL": 0.5, "TR": 0.5} # reset
return {s: 1.0}
def R(s, a):
return {"LISTEN": -1,
"OL": -100 if s == "TL" else 10,
"OR": -100 if s == "TR" else 10}[a]
def O_model(o, s, a):
if a != "LISTEN":
return 0.5
correct = (o == "HL" and s == "TL") or (o == "HR" and s == "TR")
return 0.85 if correct else 0.15
```
### Belief update (Bayes filter)
```python
def update_belief(b, a, o, S, T, O_model):
b_new = {}
for sp in S:
prior = sum(T(s, a).get(sp, 0) * b[s] for s in S)
b_new[sp] = O_model(o, sp, a) * prior
Z = sum(b_new.values())
return {s: p / Z for s, p in b_new.items()}
```
### Particle filter (continuous / large S)
```python
import numpy as np
class ParticleBelief:
def __init__(self, particles): self.p = list(particles)
def update(self, a, o, sample_T, O_model):
new = []
for s in self.p:
sp = sample_T(s, a)
w = O_model(o, sp, a)
new.append((sp, w))
# resample
ws = np.array([w for _, w in new])
ws = ws / ws.sum()
idx = np.random.choice(len(new), len(new), p=ws)
self.p = [new[i][0] for i in idx]
```
### POMCP (online MCTS on history)
```python
import math, random
from collections import defaultdict
class POMCP:
def __init__(self, gen, c=1.0, gamma=0.95):
self.gen = gen # generator: (s, a) -> (s', o, r)
self.c, self.gamma = c, gamma
self.N = defaultdict(int); self.V = defaultdict(float)
def search(self, belief, depth=20, sims=500):
for _ in range(sims):
s = random.choice(belief)
self._sim(s, (), depth)
return max(actions, key=lambda a: self.V[((), a)])
def _sim(self, s, h, d):
if d == 0: return 0
a = self._ucb(h)
sp, o, r = self.gen(s, a)
R = r + self.gamma * self._sim(sp, h + (a, o), d - 1)
self.N[(h, a)] += 1
self.V[(h, a)] += (R - self.V[(h, a)]) / self.N[(h, a)]
return R
```
### DRQN (Deep RL with recurrent belief)
```python
import torch, torch.nn as nn
class DRQN(nn.Module):
def __init__(self, obs_dim, n_act, hidden=128):
super().__init__()
self.enc = nn.Linear(obs_dim, hidden)
self.gru = nn.GRU(hidden, hidden, batch_first=True)
self.q = nn.Linear(hidden, n_act)
def forward(self, obs_seq, h0=None):
x = self.enc(obs_seq).relu()
h, hN = self.gru(x, h0)
return self.q(h), hN
```
### pomdp_py (library)
```python
import pomdp_py
# Define PomdpProblem, then:
planner = pomdp_py.POMCP(max_depth=20, num_sims=1000,
discount_factor=0.95, exploration_const=50)
action = planner.plan(agent)
```
## 매 결정 기준
| 문제 크기 | Solver |
|---|---|
| 매 |S| < 20 | exact / SARSOP |
| 매 |S| < 10⁴, offline | point-based (SARSOP) |
| 매 large S, online | POMCP / DESPOT |
| 매 raw obs (image) | DRQN / Dreamer |
| 매 unknown dynamics | Bayes-Adaptive / model-based RL |
**기본값**: SARSOP for tabular, Dreamer-V3 for pixel.
## 🔗 Graph
- 부모: [[MDP]] · [[Reinforcement-Learning]] · [[Decision Theory]]
- 응용: [[Robotics]] · [[Operation- Western Sun]]
- Adjacent: [[MCTS]]
## 🤖 LLM 활용
**언제**: 매 partial observability 문제 framing, belief-state design, solver 추천.
**언제 X**: 매 fully-observable env — MDP 면 충분.
## ❌ 안티패턴
- **Treat obs as state**: Markov violation, policy 가 frame stacking 으로 hack 만 가능.
- **Forget belief in test**: training 시 belief, deployment 시 raw obs 전달.
- **Exact solver on large S**: PWLC explosion — point-based 로.
- **No exploration in POMCP**: c=0 → greedy, belief 가 collapse.
## 🧪 검증 / 중복
- Verified (Kaelbling 1998, Silver 2010 POMCP, Hafner 2023 Dreamer-V3).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — definition + solver family + Tiger/POMCP/DRQN |