Files
2nd/10_Wiki/Topics/AI_and_ML/Eligibility-Traces.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

239 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-eligibility-traces
title: Eligibility Traces
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [eligibility trace, lambda return, TD-lambda, n-step bootstrapping, GAE]
duplicate_of: none
source_trust_level: A
confidence_score: 0.96
verification_status: applied
tags: [reinforcement-learning, eligibility-traces, td-learning, credit-assignment, gae, ppo]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch / Stable-Baselines3 / CleanRL
---
# Eligibility Traces
## 매 한 줄
> **"매 TD(0) 와 Monte Carlo 의 가운데"**. 매 λ ∈ [0, 1] 의 trade-off bias-variance. 매 Sutton-Barto canonical 알고리즘. 매 modern: 매 GAE (Generalized Advantage Estimation) — PPO 의 standard. 매 credit assignment 의 efficient.
## 매 핵심
### 매 motivation
- **TD(0)**: 매 1-step bootstrap (low variance, high bias).
- **Monte Carlo**: 매 full return (high variance, no bias).
- **TD(λ)**: 매 λ-weighted average (sweet spot).
### 매 forward view
```
G_t^λ = (1 - λ) Σ_n λ^(n-1) G_t^(n)
```
매 n-step return 의 geometric weighting.
### 매 backward view (eligibility trace)
- 매 매 state 의 trace e(s).
- 매 visit → 매 trace ↑.
- 매 decay (γλ) 매 step.
- 매 TD error δ 의 trace 의 weight 의 update.
```
e_t(s) = γλ e_{t-1}(s) + 1[S_t = s] (replacing or accumulating)
V(s) ← V(s) + α δ_t e_t(s)
```
### 매 variant
- **TD(0)**: λ=0.
- **TD(1)**: ≈ Monte Carlo.
- **TD(λ)**: 매 in between.
- **Watkins Q(λ)**: 매 off-policy 의 reset on exploration.
- **GAE(γ, λ)**: 매 modern policy gradient.
### 매 modern: GAE
```
A_t^GAE = Σ_l (γλ)^l δ_{t+l}
δ_t = r_t + γV(s_{t+1}) - V(s_t)
```
### 매 응용
1. **TD(λ) prediction**: 매 value learning.
2. **Sarsa(λ)**: 매 on-policy control.
3. **Q(λ)**: 매 off-policy.
4. **GAE in PPO/A2C**: 매 modern actor-critic.
5. **Replay buffer**: 매 trace replay.
## 💻 패턴
### TD(λ) (Sutton-Barto, accumulating trace)
```python
import numpy as np
class TDLambda:
def __init__(self, n_states, alpha=0.1, gamma=0.99, lam=0.9):
self.V = np.zeros(n_states)
self.E = np.zeros(n_states)
self.alpha, self.gamma, self.lam = alpha, gamma, lam
def reset_trace(self):
self.E[:] = 0
def step(self, s, r, s_next, done):
delta = r + (0 if done else self.gamma * self.V[s_next]) - self.V[s]
self.E[s] += 1 # 매 accumulating
self.V += self.alpha * delta * self.E
self.E *= self.gamma * self.lam
if done: self.reset_trace()
```
### Replacing trace
```python
def replacing_trace_update(self, s, r, s_next, done):
delta = r + (0 if done else self.gamma * self.V[s_next]) - self.V[s]
self.E *= self.gamma * self.lam
self.E[s] = 1 # 매 replace, not accumulate
self.V += self.alpha * delta * self.E
```
### Sarsa(λ)
```python
class SarsaLambda:
def __init__(self, n_s, n_a, alpha=0.1, gamma=0.99, lam=0.9, eps=0.1):
self.Q = np.zeros((n_s, n_a))
self.E = np.zeros((n_s, n_a))
self.alpha, self.gamma, self.lam, self.eps = alpha, gamma, lam, eps
def act(self, s):
if np.random.rand() < self.eps: return np.random.randint(self.Q.shape[1])
return self.Q[s].argmax()
def update(self, s, a, r, s_next, a_next, done):
delta = r + (0 if done else self.gamma * self.Q[s_next, a_next]) - self.Q[s, a]
self.E[s, a] += 1
self.Q += self.alpha * delta * self.E
self.E *= self.gamma * self.lam
if done: self.E[:] = 0
```
### Watkins Q(λ)
```python
def q_lambda_update(self, s, a, r, s_next, done):
a_next = self.Q[s_next].argmax()
delta = r + (0 if done else self.gamma * self.Q[s_next, a_next]) - self.Q[s, a]
self.E[s, a] += 1
self.Q += self.alpha * delta * self.E
# 매 if action was exploratory, reset trace
if exploratory: self.E[:] = 0
else: self.E *= self.gamma * self.lam
```
### GAE (PyTorch)
```python
import torch
def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
"""매 PPO standard advantage estimation."""
advantages = torch.zeros_like(rewards)
last_gae = 0
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_value = 0 # 매 bootstrap = 0 at end (or value of last state)
else:
next_value = values[t + 1]
delta = rewards[t] + gamma * next_value * (1 - dones[t]) - values[t]
last_gae = delta + gamma * lam * (1 - dones[t]) * last_gae
advantages[t] = last_gae
returns = advantages + values
return advantages, returns
```
### Lambda choice (typical)
```python
# 매 GAE
LAM_CONSERVATIVE = 0.95 # 매 PPO default — 매 stable
LAM_AGGRESSIVE = 0.99 # 매 closer to MC, more variance
LAM_BIASED = 0.9 # 매 closer to TD(0), more bias
# 매 task-dependent
def choose_lambda(task):
if task.episodes_short: return 0.95
if task.sparse_reward: return 0.99 # 매 long credit
if task.dense_reward: return 0.9
```
### N-step return
```python
def n_step_return(rewards, values, n, gamma):
"""매 forward-view n-step."""
returns = np.zeros_like(rewards)
for t in range(len(rewards)):
G = 0
for k in range(n):
if t + k < len(rewards):
G += gamma**k * rewards[t + k]
if t + n < len(values):
G += gamma**n * values[t + n]
returns[t] = G
return returns
```
### True online TD(λ)
```python
# 매 dutch trace (van Seijen)
def true_online_step(self, s, r, s_next, done):
delta = r + (0 if done else self.gamma * self.V[s_next]) - self.V[s]
e_dot_phi = self.E[s]
self.E *= self.gamma * self.lam
self.E[s] += self.alpha * (1 - self.gamma * self.lam * e_dot_phi)
self.V += (delta + self.V[s] - self.V_old) * self.E
self.V[s] -= self.alpha * (self.V[s] - self.V_old)
self.V_old = self.V[s_next] if not done else 0
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Tabular RL | TD(λ) replacing |
| Linear function approx | True online TD(λ) |
| DRL actor-critic | GAE λ=0.95 |
| Sparse reward | λ → 1 (Monte Carlo-like) |
| Dense reward | λ → 0 (TD-like) |
| Off-policy | Watkins Q(λ) or V-trace |
**기본값**: 매 modern DRL = GAE(γ=0.99, λ=0.95). 매 tabular = TD(λ) replacing trace.
## 🔗 Graph
- 부모: [[Reinforcement-Learning]] · [[TD-Learning]]
- 변형: [[TD-Lambda]] · [[GAE]]
- 응용: [[PPO]] · [[A2C]] · [[Actor-Critic]]
- Adjacent: [[Bias-Variance-Trade-off]] · [[Credit-Assignment]]
## 🤖 LLM 활용
**언제**: 매 RL credit assignment. 매 actor-critic. 매 sparse reward.
**언제 X**: 매 deterministic supervised. 매 1-step bandit.
## ❌ 안티패턴
- **λ=1 always**: 매 high variance.
- **λ=0 always**: 매 high bias 의 long-horizon 의 fail.
- **Forget trace reset**: 매 episode boundary.
- **GAE without value baseline**: 매 advantage 의 wrong.
- **Wrong direction loop**: 매 forward 의 do (must reverse).
## 🧪 검증 / 중복
- Verified (Sutton-Barto Ch12, Schulman GAE 2016, PPO 2017).
- 신뢰도 A.
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-04-26 | RL-ELIG auto |
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — TD(λ) + GAE + 매 forward / backward / Sarsa / Watkins / true online code |