Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

6.4 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Actor-Critic Models

매 한 줄

"매 policy (actor) + value estimator (critic) 의 jointly train". Actor-critic = 매 policy gradient 의 high-variance 의 critic baseline (V or Q) 의 reduce 의 hybrid RL family. 매 modern landscape 의 backbone — PPO (Atari, locomotion, RLHF), SAC (continuous control), IMPALA/Ape-X (distributed), GRPO (LLM RL post-training, Claude/DeepSeek 2024-2026).

매 핵심

매 motivation

REINFORCE pure policy gradient: ∇log π(a|s) · R — 매 high variance, 매 slow.
Value-only (DQN): 매 discrete action 의 only, 매 stochastic policy 의 X.
Actor-critic: ∇log π(a|s) · A(s,a) where A = Q − V (advantage) — 매 variance 의 reduce + 매 continuous action.

매 advantage estimation

Monte Carlo: A = G_t − V(s) — 매 unbiased, 매 high variance.
TD(0): A = r + γV(s') − V(s) — 매 biased, 매 low variance.
GAE (Generalized Advantage Estimation): 매 λ-weighted blend — 매 modern default.

매 algorithm zoo

A2C / A3C (2016): 매 synchronous / async parallel actor.
PPO (2017): 매 clipped ratio, 매 industry default — robust + simple.
SAC (2018): 매 entropy-regularized, 매 off-policy continuous.
TD3: 매 twin Q + delayed policy update — DDPG fix.
IMPALA: 매 V-trace correction 의 distributed off-policy.
GRPO (DeepSeek 2024): 매 group relative advantage — 매 LLM RL post-training, 매 critic-free variant.
DPO / IPO / KTO (2023-2024): 매 preference-based, 매 critic 의 implicit.

매 응용

Game (Atari, StarCraft II, Dota 2 OpenAI Five).
Robotics (locomotion, manipulation — SAC default).
LLM RLHF post-training (PPO → GRPO / DPO 의 shift 2024-2026).
Recommendation (counterfactual policy learning).
Trading / market-making (risk-adjusted reward).
Autonomous driving sim-to-real.

💻 패턴

PPO core (CleanRL-style)

import torch, torch.nn as nn
from torch.distributions import Normal

class ActorCritic(nn.Module):
    def __init__(self, obs_dim, act_dim):
        super().__init__()
        self.shared = nn.Sequential(nn.Linear(obs_dim, 64), nn.Tanh(), nn.Linear(64, 64), nn.Tanh())
        self.mu = nn.Linear(64, act_dim)
        self.log_std = nn.Parameter(torch.zeros(act_dim))
        self.v = nn.Linear(64, 1)

    def forward(self, x):
        h = self.shared(x)
        return Normal(self.mu(h), self.log_std.exp()), self.v(h).squeeze(-1)

def ppo_loss(logp_new, logp_old, adv, value, ret, ent, clip=0.2, vc=0.5, ec=0.01):
    ratio = (logp_new - logp_old).exp()
    surr1 = ratio * adv
    surr2 = ratio.clamp(1 - clip, 1 + clip) * adv
    pi_loss = -torch.min(surr1, surr2).mean()
    v_loss = ((value - ret) ** 2).mean()
    return pi_loss + vc * v_loss - ec * ent.mean()

GAE

def gae(rewards, values, dones, gamma=0.99, lam=0.95):
    adv = torch.zeros_like(rewards)
    last = 0.0
    for t in reversed(range(len(rewards))):
        nonterm = 1.0 - dones[t]
        delta = rewards[t] + gamma * values[t+1] * nonterm - values[t]
        last = delta + gamma * lam * nonterm * last
        adv[t] = last
    return adv

SAC update (continuous control)

# 매 twin Q + entropy temperature α auto-tune
q_target = r + gamma * (1 - d) * (torch.min(q1_t(s2, a2), q2_t(s2, a2)) - alpha * logp_a2)
q1_loss = ((q1(s, a) - q_target.detach()) ** 2).mean()
pi_loss = (alpha * logp - torch.min(q1(s, a_pi), q2(s, a_pi))).mean()
alpha_loss = -(log_alpha * (logp + target_entropy).detach()).mean()

GRPO (LLM RL post-training, 2024-2026)

# 매 group of K samples per prompt, 매 critic 의 X — group mean baseline
def grpo_advantage(rewards):  # rewards: (B, K)
    mean = rewards.mean(dim=1, keepdim=True)
    std = rewards.std(dim=1, keepdim=True) + 1e-8
    return (rewards - mean) / std  # 매 normalized advantage

# loss = -E[ A * log π(y|x) ] + β * KL(π || π_ref)

DPO (preference-only, no reward model, no critic)

def dpo_loss(logp_w, logp_l, ref_logp_w, ref_logp_l, beta=0.1):
    # w = winner (preferred), l = loser
    return -torch.nn.functional.logsigmoid(beta * ((logp_w - ref_logp_w) - (logp_l - ref_logp_l))).mean()

매 결정 기준

상황	Algorithm
Discrete action, on-policy	PPO
Continuous control, sample-efficient	SAC
Massive parallel sim	IMPALA / Ape-X
LLM RLHF (with reward model)	PPO → GRPO 의 shift
LLM preference data only	DPO / IPO / KTO
Sparse reward, exploration-hard	PPO + RND/ICM
Offline data only	CQL / IQL (offline RL)

기본값: 매 robotics — SAC. 매 game/sim — PPO. 매 LLM post-training — GRPO 또는 DPO.

🔗 Graph

부모: Reinforcement Learning · Policy Gradient Methods
변형: PPO · A3C · GRPO
응용: RLHF
Adjacent: GAE · DPO

🤖 LLM 활용

언제: 매 LLM RLHF / RLAIF post-training (PPO/GRPO), 매 RL agent code review. 언제 X: 매 supervised data 의 abundant + simple — 매 SFT 의 first 의 try.

❌ 안티패턴

No advantage normalize: 매 PPO 의 unstable — 매 per-batch normalize.
Shared trunk too large: 매 actor/critic interference — 매 separate head 의 prefer 의 large model.
Reward scaling skip: 매 value loss 의 explode — 매 running mean/std normalize.
Off-policy data 의 PPO 의 reuse epoch >10: 매 ratio 의 explode — 매 4-10 epoch only.
Critic 의 frozen leave: 매 value bootstrap 의 stale — 매 jointly update.
GRPO 의 K=2: 매 baseline noise — 매 K≥4 (보통 8-16).

🧪 검증 / 중복

Verified (Sutton & Barto 2nd ed Ch 13; Schulman et al. PPO 2017; Haarnoja SAC 2018; DeepSeek-Math GRPO 2024; Rafailov DPO 2023; CleanRL implementations).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — PPO/SAC/GRPO/DPO 2026 landscape + working code

6.4 KiB Raw Blame History Unescape Escape