--- id: wiki-2026-0508-eligibility-traces title: Eligibility Traces category: 10_Wiki/Topics status: verified canonical_id: self aliases: [eligibility trace, lambda return, TD-lambda, n-step bootstrapping, GAE] duplicate_of: none source_trust_level: A confidence_score: 0.96 verification_status: applied tags: [reinforcement-learning, eligibility-traces, td-learning, credit-assignment, gae, ppo] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: PyTorch / Stable-Baselines3 / CleanRL --- # Eligibility Traces ## 매 한 줄 > **"매 TD(0) 와 Monte Carlo 의 가운데"**. 매 λ ∈ [0, 1] 의 trade-off bias-variance. 매 Sutton-Barto canonical 알고리즘. 매 modern: 매 GAE (Generalized Advantage Estimation) — PPO 의 standard. 매 credit assignment 의 efficient. ## 매 핵심 ### 매 motivation - **TD(0)**: 매 1-step bootstrap (low variance, high bias). - **Monte Carlo**: 매 full return (high variance, no bias). - **TD(λ)**: 매 λ-weighted average (sweet spot). ### 매 forward view ``` G_t^λ = (1 - λ) Σ_n λ^(n-1) G_t^(n) ``` 매 n-step return 의 geometric weighting. ### 매 backward view (eligibility trace) - 매 매 state 의 trace e(s). - 매 visit → 매 trace ↑. - 매 decay (γλ) 매 step. - 매 TD error δ 의 trace 의 weight 의 update. ``` e_t(s) = γλ e_{t-1}(s) + 1[S_t = s] (replacing or accumulating) V(s) ← V(s) + α δ_t e_t(s) ``` ### 매 variant - **TD(0)**: λ=0. - **TD(1)**: ≈ Monte Carlo. - **TD(λ)**: 매 in between. - **Watkins Q(λ)**: 매 off-policy 의 reset on exploration. - **GAE(γ, λ)**: 매 modern policy gradient. ### 매 modern: GAE ``` A_t^GAE = Σ_l (γλ)^l δ_{t+l} δ_t = r_t + γV(s_{t+1}) - V(s_t) ``` ### 매 응용 1. **TD(λ) prediction**: 매 value learning. 2. **Sarsa(λ)**: 매 on-policy control. 3. **Q(λ)**: 매 off-policy. 4. **GAE in PPO/A2C**: 매 modern actor-critic. 5. **Replay buffer**: 매 trace replay. ## 💻 패턴 ### TD(λ) (Sutton-Barto, accumulating trace) ```python import numpy as np class TDLambda: def __init__(self, n_states, alpha=0.1, gamma=0.99, lam=0.9): self.V = np.zeros(n_states) self.E = np.zeros(n_states) self.alpha, self.gamma, self.lam = alpha, gamma, lam def reset_trace(self): self.E[:] = 0 def step(self, s, r, s_next, done): delta = r + (0 if done else self.gamma * self.V[s_next]) - self.V[s] self.E[s] += 1 # 매 accumulating self.V += self.alpha * delta * self.E self.E *= self.gamma * self.lam if done: self.reset_trace() ``` ### Replacing trace ```python def replacing_trace_update(self, s, r, s_next, done): delta = r + (0 if done else self.gamma * self.V[s_next]) - self.V[s] self.E *= self.gamma * self.lam self.E[s] = 1 # 매 replace, not accumulate self.V += self.alpha * delta * self.E ``` ### Sarsa(λ) ```python class SarsaLambda: def __init__(self, n_s, n_a, alpha=0.1, gamma=0.99, lam=0.9, eps=0.1): self.Q = np.zeros((n_s, n_a)) self.E = np.zeros((n_s, n_a)) self.alpha, self.gamma, self.lam, self.eps = alpha, gamma, lam, eps def act(self, s): if np.random.rand() < self.eps: return np.random.randint(self.Q.shape[1]) return self.Q[s].argmax() def update(self, s, a, r, s_next, a_next, done): delta = r + (0 if done else self.gamma * self.Q[s_next, a_next]) - self.Q[s, a] self.E[s, a] += 1 self.Q += self.alpha * delta * self.E self.E *= self.gamma * self.lam if done: self.E[:] = 0 ``` ### Watkins Q(λ) ```python def q_lambda_update(self, s, a, r, s_next, done): a_next = self.Q[s_next].argmax() delta = r + (0 if done else self.gamma * self.Q[s_next, a_next]) - self.Q[s, a] self.E[s, a] += 1 self.Q += self.alpha * delta * self.E # 매 if action was exploratory, reset trace if exploratory: self.E[:] = 0 else: self.E *= self.gamma * self.lam ``` ### GAE (PyTorch) ```python import torch def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95): """매 PPO standard advantage estimation.""" advantages = torch.zeros_like(rewards) last_gae = 0 for t in reversed(range(len(rewards))): if t == len(rewards) - 1: next_value = 0 # 매 bootstrap = 0 at end (or value of last state) else: next_value = values[t + 1] delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t] last_gae = delta + gamma * lam * (1 - dones[t]) * last_gae advantages[t] = last_gae returns = advantages + values return advantages, returns ``` ### Lambda choice (typical) ```python # 매 GAE LAM_CONSERVATIVE = 0.95 # 매 PPO default — 매 stable LAM_AGGRESSIVE = 0.99 # 매 closer to MC, more variance LAM_BIASED = 0.9 # 매 closer to TD(0), more bias # 매 task-dependent def choose_lambda(task): if task.episodes_short: return 0.95 if task.sparse_reward: return 0.99 # 매 long credit if task.dense_reward: return 0.9 ``` ### N-step return ```python def n_step_return(rewards, values, n, gamma): """매 forward-view n-step.""" returns = np.zeros_like(rewards) for t in range(len(rewards)): G = 0 for k in range(n): if t + k < len(rewards): G += gamma**k * rewards[t + k] if t + n < len(values): G += gamma**n * values[t + n] returns[t] = G return returns ``` ### True online TD(λ) ```python # 매 dutch trace (van Seijen) def true_online_step(self, s, r, s_next, done): delta = r + (0 if done else self.gamma * self.V[s_next]) - self.V[s] e_dot_phi = self.E[s] self.E *= self.gamma * self.lam self.E[s] += self.alpha * (1 - self.gamma * self.lam * e_dot_phi) self.V += (delta + self.V[s] - self.V_old) * self.E self.V[s] -= self.alpha * (self.V[s] - self.V_old) self.V_old = self.V[s_next] if not done else 0 ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Tabular RL | TD(λ) replacing | | Linear function approx | True online TD(λ) | | DRL actor-critic | GAE λ=0.95 | | Sparse reward | λ → 1 (Monte Carlo-like) | | Dense reward | λ → 0 (TD-like) | | Off-policy | Watkins Q(λ) or V-trace | **기본값**: 매 modern DRL = GAE(γ=0.99, λ=0.95). 매 tabular = TD(λ) replacing trace. ## 🔗 Graph - 부모: [[Reinforcement-Learning]] · [[TD-Learning]] - 변형: [[TD-Lambda]] · [[GAE]] - 응용: [[PPO]] · [[A2C]] · [[Actor-Critic]] - Adjacent: [[Bias-Variance-Trade-off]] · [[Credit-Assignment]] ## 🤖 LLM 활용 **언제**: 매 RL credit assignment. 매 actor-critic. 매 sparse reward. **언제 X**: 매 deterministic supervised. 매 1-step bandit. ## ❌ 안티패턴 - **λ=1 always**: 매 high variance. - **λ=0 always**: 매 high bias 의 long-horizon 의 fail. - **Forget trace reset**: 매 episode boundary. - **GAE without value baseline**: 매 advantage 의 wrong. - **Wrong direction loop**: 매 forward 의 do (must reverse). ## 🧪 검증 / 중복 - Verified (Sutton-Barto Ch12, Schulman GAE 2016, PPO 2017). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-04-26 | RL-ELIG auto | | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — TD(λ) + GAE + 매 forward / backward / Sarsa / Watkins / true online code |