--- id: wiki-2026-0508-actor-critic-models title: Actor-Critic Models category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Actor-Critic, A2C, A3C, PPO family, Policy + Value Methods] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [reinforcement-learning, actor-critic, ppo, a3c, sac, rlhf] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch/CleanRL/TorchRL --- # Actor-Critic Models ## 매 한 줄 > **"매 policy (actor) + value estimator (critic) 의 jointly train"**. Actor-critic = 매 policy gradient 의 high-variance 의 critic baseline (V or Q) 의 reduce 의 hybrid RL family. 매 modern landscape 의 backbone — PPO (Atari, locomotion, RLHF), SAC (continuous control), IMPALA/Ape-X (distributed), GRPO (LLM RL post-training, Claude/DeepSeek 2024-2026). ## 매 핵심 ### 매 motivation - **REINFORCE pure policy gradient**: ∇log π(a|s) · R — 매 high variance, 매 slow. - **Value-only (DQN)**: 매 discrete action 의 only, 매 stochastic policy 의 X. - **Actor-critic**: ∇log π(a|s) · A(s,a) where A = Q − V (advantage) — 매 variance 의 reduce + 매 continuous action. ### 매 advantage estimation - **Monte Carlo**: A = G_t − V(s) — 매 unbiased, 매 high variance. - **TD(0)**: A = r + γV(s') − V(s) — 매 biased, 매 low variance. - **GAE (Generalized Advantage Estimation)**: 매 λ-weighted blend — 매 modern default. ### 매 algorithm zoo - **A2C / A3C** (2016): 매 synchronous / async parallel actor. - **PPO** (2017): 매 clipped ratio, 매 industry default — robust + simple. - **SAC** (2018): 매 entropy-regularized, 매 off-policy continuous. - **TD3**: 매 twin Q + delayed policy update — DDPG fix. - **IMPALA**: 매 V-trace correction 의 distributed off-policy. - **GRPO** (DeepSeek 2024): 매 group relative advantage — 매 LLM RL post-training, 매 critic-free variant. - **DPO / IPO / KTO** (2023-2024): 매 preference-based, 매 critic 의 implicit. ### 매 응용 1. Game (Atari, StarCraft II, Dota 2 OpenAI Five). 2. Robotics (locomotion, manipulation — SAC default). 3. LLM RLHF post-training (PPO → GRPO / DPO 의 shift 2024-2026). 4. Recommendation (counterfactual policy learning). 5. Trading / market-making (risk-adjusted reward). 6. Autonomous driving sim-to-real. ## 💻 패턴 ### PPO core (CleanRL-style) ```python import torch, torch.nn as nn from torch.distributions import Normal class ActorCritic(nn.Module): def __init__(self, obs_dim, act_dim): super().__init__() self.shared = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(), nn.Linear(64, 64), nn.Tanh()) self.mu = nn.Linear(64, act_dim) self.log_std = nn.Parameter(torch.zeros(act_dim)) self.v = nn.Linear(64, 1) def forward(self, x): h = self.shared(x) return Normal(self.mu(h), self.log_std.exp()), self.v(h).squeeze(-1) def ppo_loss(logp_new, logp_old, adv, value, ret, ent, clip=0.2, vc=0.5, ec=0.01): ratio = (logp_new - logp_old).exp() surr1 = ratio * adv surr2 = ratio.clamp(1 - clip, 1 + clip) * adv pi_loss = -torch.min(surr1, surr2).mean() v_loss = ((value - ret) ** 2).mean() return pi_loss + vc * v_loss - ec * ent.mean() ``` ### GAE ```python def gae(rewards, values, dones, gamma=0.99, lam=0.95): adv = torch.zeros_like(rewards) last = 0.0 for t in reversed(range(len(rewards))): nonterm = 1.0 - dones[t] delta = rewards[t] + gamma * values[t+1] * nonterm - values[t] last = delta + gamma * lam * nonterm * last adv[t] = last return adv ``` ### SAC update (continuous control) ```python # 매 twin Q + entropy temperature α auto-tune q_target = r + gamma * (1 - d) * (torch.min(q1_t(s2, a2), q2_t(s2, a2)) - alpha * logp_a2) q1_loss = ((q1(s, a) - q_target.detach()) ** 2).mean() pi_loss = (alpha * logp - torch.min(q1(s, a_pi), q2(s, a_pi))).mean() alpha_loss = -(log_alpha * (logp + target_entropy).detach()).mean() ``` ### GRPO (LLM RL post-training, 2024-2026) ```python # 매 group of K samples per prompt, 매 critic 의 X — group mean baseline def grpo_advantage(rewards): # rewards: (B, K) mean = rewards.mean(dim=1, keepdim=True) std = rewards.std(dim=1, keepdim=True) + 1e-8 return (rewards - mean) / std # 매 normalized advantage # loss = -E[ A * log π(y|x) ] + β * KL(π || π_ref) ``` ### DPO (preference-only, no reward model, no critic) ```python def dpo_loss(logp_w, logp_l, ref_logp_w, ref_logp_l, beta=0.1): # w = winner (preferred), l = loser return -torch.nn.functional.logsigmoid(beta * ((logp_w - ref_logp_w) - (logp_l - ref_logp_l))).mean() ``` ## 매 결정 기준 | 상황 | Algorithm | |---|---| | Discrete action, on-policy | PPO | | Continuous control, sample-efficient | SAC | | Massive parallel sim | IMPALA / Ape-X | | LLM RLHF (with reward model) | PPO → GRPO 의 shift | | LLM preference data only | DPO / IPO / KTO | | Sparse reward, exploration-hard | PPO + RND/ICM | | Offline data only | CQL / IQL (offline RL) | **기본값**: 매 robotics — SAC. 매 game/sim — PPO. 매 LLM post-training — GRPO 또는 DPO. ## 🔗 Graph - 부모: [[Reinforcement Learning]] · [[Policy Gradient Methods]] - 변형: [[PPO]] · [[A3C]] · [[GRPO]] - 응용: [[RLHF]] - Adjacent: [[GAE]] · [[DPO]] ## 🤖 LLM 활용 **언제**: 매 LLM RLHF / RLAIF post-training (PPO/GRPO), 매 RL agent code review. **언제 X**: 매 supervised data 의 abundant + simple — 매 SFT 의 first 의 try. ## ❌ 안티패턴 - **No advantage normalize**: 매 PPO 의 unstable — 매 per-batch normalize. - **Shared trunk too large**: 매 actor/critic interference — 매 separate head 의 prefer 의 large model. - **Reward scaling skip**: 매 value loss 의 explode — 매 running mean/std normalize. - **Off-policy data 의 PPO 의 reuse epoch >10**: 매 ratio 의 explode — 매 4-10 epoch only. - **Critic 의 frozen leave**: 매 value bootstrap 의 stale — 매 jointly update. - **GRPO 의 K=2**: 매 baseline noise — 매 K≥4 (보통 8-16). ## 🧪 검증 / 중복 - Verified (Sutton & Barto 2nd ed Ch 13; Schulman et al. PPO 2017; Haarnoja SAC 2018; DeepSeek-Math GRPO 2024; Rafailov DPO 2023; CleanRL implementations). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — PPO/SAC/GRPO/DPO 2026 landscape + working code |