id, title, category, status, canonical_id, duplicate_of, aliases, source_trust_level, confidence_score, verification_status, tags, last_reinforced, github_commit
| id |
title |
category |
status |
canonical_id |
duplicate_of |
aliases |
source_trust_level |
confidence_score |
verification_status |
tags |
last_reinforced |
github_commit |
| wiki-2026-0508-policy-gradient-methods |
Policy Gradient Methods |
10_Wiki/Topics |
duplicate |
policy-optimization |
Policy-Optimization |
|
A |
0.9 |
redirected |
| duplicate |
| reinforcement-learning |
| policy-gradient |
|
2026-05-10 |
pending |
Policy Gradient Methods
이 문서는 Policy-Optimization 의 중복본입니다. Canonical 문서로 redirect.
핵심 요약 (PG-specific aspects)
- 매 policy gradient = ∇J = E[∇log π · A] — 매 foundational identity.
- 매 REINFORCE → A2C → TRPO → PPO → GRPO → DPO 매 lineage 매 Policy-Optimization 에 정리.
- 매 vanilla PG 매 high variance — 매 baseline + GAE 의 mitigate.
🔗 Graph
🕓 변경 이력
| 날짜 |
변경 |
| 2026-05-08 |
Phase 1 |
| 2026-05-10 |
중복 처리 — canonical 문서로 redirect |