"매 policy (actor) + value estimator (critic) 의 jointly train". Actor-critic = 매 policy gradient 의 high-variance 의 critic baseline (V or Q) 의 reduce 의 hybrid RL family. 매 modern landscape 의 backbone — PPO (Atari, locomotion, RLHF), SAC (continuous control), IMPALA/Ape-X (distributed), GRPO (LLM RL post-training, Claude/DeepSeek 2024-2026).
매 핵심
매 motivation
REINFORCE pure policy gradient: ∇log π(a|s) · R — 매 high variance, 매 slow.
Value-only (DQN): 매 discrete action 의 only, 매 stochastic policy 의 X.
Actor-critic: ∇log π(a|s) · A(s,a) where A = Q − V (advantage) — 매 variance 의 reduce + 매 continuous action.
매 advantage estimation
Monte Carlo: A = G_t − V(s) — 매 unbiased, 매 high variance.
TD(0): A = r + γV(s') − V(s) — 매 biased, 매 low variance.
GAE (Generalized Advantage Estimation): 매 λ-weighted blend — 매 modern default.
매 algorithm zoo
A2C / A3C (2016): 매 synchronous / async parallel actor.
PPO (2017): 매 clipped ratio, 매 industry default — robust + simple.
SAC (2018): 매 entropy-regularized, 매 off-policy continuous.
TD3: 매 twin Q + delayed policy update — DDPG fix.
IMPALA: 매 V-trace correction 의 distributed off-policy.
GRPO (DeepSeek 2024): 매 group relative advantage — 매 LLM RL post-training, 매 critic-free variant.
DPO / IPO / KTO (2023-2024): 매 preference-based, 매 critic 의 implicit.
# 매 twin Q + entropy temperature α auto-tuneq_target=r+gamma*(1-d)*(torch.min(q1_t(s2,a2),q2_t(s2,a2))-alpha*logp_a2)q1_loss=((q1(s,a)-q_target.detach())**2).mean()pi_loss=(alpha*logp-torch.min(q1(s,a_pi),q2(s,a_pi))).mean()alpha_loss=-(log_alpha*(logp+target_entropy).detach()).mean()
GRPO (LLM RL post-training, 2024-2026)
# 매 group of K samples per prompt, 매 critic 의 X — group mean baselinedefgrpo_advantage(rewards):# rewards: (B, K)mean=rewards.mean(dim=1,keepdim=True)std=rewards.std(dim=1,keepdim=True)+1e-8return(rewards-mean)/std# 매 normalized advantage# loss = -E[ A * log π(y|x) ] + β * KL(π || π_ref)
DPO (preference-only, no reward model, no critic)
defdpo_loss(logp_w,logp_l,ref_logp_w,ref_logp_l,beta=0.1):# w = winner (preferred), l = loserreturn-torch.nn.functional.logsigmoid(beta*((logp_w-ref_logp_w)-(logp_l-ref_logp_l))).mean()
매 결정 기준
상황
Algorithm
Discrete action, on-policy
PPO
Continuous control, sample-efficient
SAC
Massive parallel sim
IMPALA / Ape-X
LLM RLHF (with reward model)
PPO → GRPO 의 shift
LLM preference data only
DPO / IPO / KTO
Sparse reward, exploration-hard
PPO + RND/ICM
Offline data only
CQL / IQL (offline RL)
기본값: 매 robotics — SAC. 매 game/sim — PPO. 매 LLM post-training — GRPO 또는 DPO.