"매 Cooperative Inverse Partially Observable MDP — 매 human + agent 의 shared reward, 매 reward function 의 hidden parameter." 매 Hadfield-Menell et al. (CIRL 2016) 의 POMDP extension — 매 assistance games, alignment formalization 의 backbone — 매 2026 에 LLM agent assistance 의 theoretical frame.
매 핵심
매 정의
State s ∈ S, Actions (a^H, a^R) for human + robot.
Reward parameter θ ∈ Θ — 매 human 의 known, robot 의 unknown.
Reward r(s, a^H, a^R; θ) — 매 shared.
Observations o^H, o^R — 매 partial.
Goal: maximize E[Σ r(s, a^H, a^R; θ)] — 매 robot 의 θ 의 inference + acting.
매 properties
Active learning: 매 robot 의 information-gathering actions.
Off-switch problem: 매 robot 의 uncertainty 의 corrigibility 의 induce.
Reward hacking immunity (in theory): 매 θ unknown → 매 proxy 의 over-optimize 의 X.
매 응용
Assistance games (cleaning, cooking robot).
RLHF formalization — 매 preference 의 reward 의 evidence.
Multi-agent communication (CIPOMDPs with messages).
defrobot_action(s,b_theta,theta_grid,gamma=0.95):# Pick a^R maximizing expected return under beliefbest_a,best_eu=None,-np.inffora_Rinactions_R:eu=sum(b*V_pi(s,a_R,theta,gamma)forb,thetainzip(b_theta,theta_grid))ifeu>best_eu:best_a,best_eu=a_R,eureturnbest_a
Off-switch game
defoff_switch_decision(b_theta,theta_grid,action_value,switch_off_value=0):# Robot defers to human if uncertain about rewardexpected_action_value=sum(b*action_value(theta)forb,thetainzip(b_theta,theta_grid))# If human can correct, deferring dominates when uncertainifexpected_action_value<switch_off_value+uncertainty_bonus(b_theta):return"wait_for_human"return"act"