--- id: wiki-2026-0508-cipomdps title: CIPOMDPs category: 10_Wiki/Topics status: verified canonical_id: self aliases: [CI-POMDP, Communicative Interactive POMDP, Cooperative Inverse POMDP] duplicate_of: none source_trust_level: A confidence_score: 0.85 verification_status: applied tags: [rl, pomdp, multi-agent, alignment] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pomdp-py/pettingzoo --- # CIPOMDPs ## 매 한 줄 > **"매 Cooperative Inverse Partially Observable MDP — 매 human + agent 의 shared reward, 매 reward function 의 hidden parameter."** 매 Hadfield-Menell et al. (CIRL 2016) 의 POMDP extension — 매 assistance games, alignment formalization 의 backbone — 매 2026 에 LLM agent assistance 의 theoretical frame. ## 매 핵심 ### 매 정의 - **State** s ∈ S, **Actions** (a^H, a^R) for human + robot. - **Reward parameter** θ ∈ Θ — 매 human 의 known, robot 의 unknown. - **Reward** r(s, a^H, a^R; θ) — 매 shared. - **Observations** o^H, o^R — 매 partial. - **Goal**: maximize E[Σ r(s, a^H, a^R; θ)] — 매 robot 의 θ 의 inference + acting. ### 매 properties - **Active learning**: 매 robot 의 information-gathering actions. - **Off-switch problem**: 매 robot 의 uncertainty 의 corrigibility 의 induce. - **Reward hacking immunity** (in theory): 매 θ unknown → 매 proxy 의 over-optimize 의 X. ### 매 응용 1. Assistance games (cleaning, cooking robot). 2. RLHF formalization — 매 preference 의 reward 의 evidence. 3. Multi-agent communication (CIPOMDPs with messages). ## 💻 패턴 ### CIPOMDP belief update ```python import numpy as np def update_theta_belief(b_theta, s, a_H, theta_grid, beta=1.0): # Boltzmann human: P(a_H | s, theta) ∝ exp(beta * Q*(s, a_H; theta)) likelihoods = np.array([ np.exp(beta * Q_star(s, a_H, theta)) / sum(np.exp(beta * Q_star(s, a, theta)) for a in actions_H) for theta in theta_grid ]) posterior = b_theta * likelihoods return posterior / posterior.sum() ``` ### Robot policy (expected utility over θ) ```python def robot_action(s, b_theta, theta_grid, gamma=0.95): # Pick a^R maximizing expected return under belief best_a, best_eu = None, -np.inf for a_R in actions_R: eu = sum( b * V_pi(s, a_R, theta, gamma) for b, theta in zip(b_theta, theta_grid) ) if eu > best_eu: best_a, best_eu = a_R, eu return best_a ``` ### Off-switch game ```python def off_switch_decision(b_theta, theta_grid, action_value, switch_off_value=0): # Robot defers to human if uncertain about reward expected_action_value = sum( b * action_value(theta) for b, theta in zip(b_theta, theta_grid) ) # If human can correct, deferring dominates when uncertain if expected_action_value < switch_off_value + uncertainty_bonus(b_theta): return "wait_for_human" return "act" ``` ### Active query (info gain) ```python def best_query(b_theta, theta_grid, candidate_queries): def expected_info_gain(q): H_prior = entropy(b_theta) H_post = sum( P_response(r, q, theta) * b * entropy(update_theta_belief_query(b_theta, q, r, theta_grid)) for theta, b in zip(theta_grid, b_theta) for r in possible_responses ) return H_prior - H_post return max(candidate_queries, key=expected_info_gain) ``` ### Boltzmann human model ```python def boltzmann_human(s, theta, beta=1.0): qs = np.array([Q_star(s, a, theta) for a in actions_H]) probs = np.exp(beta * qs - np.max(beta * qs)) return probs / probs.sum() ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Small θ space | Exact belief update + value iteration | | Large θ | Particle filter + POMCP/POMCPOW | | Continuous θ | Variational + amortized inference | | Real human | Boltzmann + irrationality terms (myopia, bias) | | Communication | CIPOMDPs with message channel | **기본값**: 매 particle filter belief + MCTS robot policy. ## 🔗 Graph - 부모: [[POMDP]] - 응용: [[RLHF]] - Adjacent: [[AI Safety and Alignment]] · [[Theory of Mind]] ## 🤖 LLM 활용 **언제**: 매 assistant agent design 의 theoretical justification, 매 ambiguity-handling spec. **언제 X**: 매 small tactical decision 의 deployment-ready code. ## ❌ 안티패턴 - **Maximize-best-guess θ**: 매 expected utility 의 over Θ — not max-likelihood θ. - **Rational human assumption**: 매 noisy/biased — 매 Boltzmann + bias models. - **Static θ**: 매 preferences drift — 매 non-stationary θ 의 model. - **Ignoring corrigibility**: 매 θ certainty 의 prematurity 의 dangerous. ## 🧪 검증 / 중복 - Verified (Hadfield-Menell et al. NeurIPS 2016, Russell *Human Compatible* 2019). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — formal CIRL/CIPOMDP with code |