Files
2nd/10_Wiki/Topics/AI_and_ML/Credit Assignment Problem.md
T
Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization
10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:52:15 +09:00

284 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-credit-assignment
title: Credit Assignment Problem
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [credit assignment, temporal credit, structural credit, backpropagation, GAE, PRM, attribution]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [reinforcement-learning, credit-assignment, backpropagation, gae, prm, attribution, multi-agent, llm]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch / JAX / RL libs
---
# Credit Assignment Problem
## 매 한 줄
> **"매 누가 / 매 무엇 의 reward 의 기여?"**. 매 long sequence 의 final reward 의 매 step 별 attribution. 매 RL 의 fundamental + 매 deep learning 의 backprop 의 essence. 매 modern: GAE, PRM, RLHF, multi-agent.
## 매 핵심 type
### Temporal Credit Assignment
- 매 sequence of action → 매 final reward.
- 매 어떤 action 의 결정?
- 매 RL 의 challenge.
### Structural Credit Assignment
- 매 layered NN → 매 error.
- 매 어떤 weight / neuron 의 fix?
- 매 backprop 의 solve.
### Multi-agent Credit
- 매 N agent → 매 collective reward.
- 매 individual contribution.
## 매 solution
### Backpropagation (structural)
- 매 chain rule.
- 매 each layer 의 gradient.
- 매 1986 Rumelhart-Hinton-Williams.
### TD Learning (temporal)
- 매 bootstrap.
- 매 [[Computational-Neuroscience-RL]] 참조.
### Eligibility Trace
- 매 past action 의 trace 유지.
- 매 TD(λ).
### GAE (Generalized Advantage Estimation)
- Schulman 2015.
- 매 bias-variance trade-off.
- 매 PPO 의 standard.
- $A_t = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$
### Hindsight Experience Replay (HER)
- 매 fail trajectory 의 매 different goal 의 reuse.
### Reward Shaping
- 매 dense intermediate reward.
- 매 careful: 매 unintended optimal X.
### Process Reward Model (PRM, modern)
- 매 매 step 의 grade.
- 매 OpenAI Math, 매 DeepSeek-Prover.
- 매 outcome reward 보다 매 finer.
### Counterfactual (multi-agent)
- COMA (Counterfactual Multi-Agent Policy Gradient).
- 매 1 agent 의 fix → 매 contribution.
### Attention attribution (LLM)
- 매 attention score 의 attribution.
- 매 SHAP, integrated gradient.
## 매 응용
1. **Game AI**: 매 chess / Go (long horizon).
2. **Robotics**: 매 sparse reward.
3. **LLM RLHF**: 매 token-level reward.
4. **Multi-agent**: 매 cooperative.
5. **Medical**: 매 long-term outcome.
6. **Finance**: 매 portfolio.
## 💻 패턴
### Backpropagation (structural)
```python
import torch
x = torch.tensor([1.0, 2.0], requires_grad=True)
W1 = torch.randn(2, 3, requires_grad=True)
W2 = torch.randn(3, 1, requires_grad=True)
h = torch.relu(x @ W1)
y = h @ W2
loss = (y - target).pow(2).mean()
loss.backward() # 매 W1, W2 의 gradient (credit) 계산.
print(W1.grad) # 매 each weight 의 contribution.
```
### TD(0) (temporal)
```python
def td_update(V, state, reward, next_state, alpha=0.1, gamma=0.95):
td_error = reward + gamma * V[next_state] - V[state]
V[state] += alpha * td_error
return V
```
### TD(λ) with eligibility trace
```python
class TDLambda:
def __init__(self, n_states, alpha=0.1, gamma=0.95, lam=0.9):
self.V = np.zeros(n_states)
self.e = np.zeros(n_states)
self.alpha, self.gamma, self.lam = alpha, gamma, lam
def update(self, state, reward, next_state):
td_error = reward + self.gamma * self.V[next_state] - self.V[state]
self.e *= self.gamma * self.lam
self.e[state] += 1 # 매 visited state 의 trace 증가
self.V += self.alpha * td_error * self.e # 매 trace 비례 update
```
### GAE (PPO standard)
```python
def compute_gae(rewards, values, gamma=0.99, lam=0.95):
"""매 매 step 의 advantage."""
advantages = np.zeros_like(rewards)
last_gae = 0
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * values[t+1] - values[t]
advantages[t] = last_gae = delta + gamma * lam * last_gae
return advantages
```
### Hindsight Experience Replay
```python
def her_relabel(trajectory, goal_extractor):
"""매 failure 의 매 goal 의 reach 의 success 로 relabel."""
new_trajectories = [trajectory]
# 매 final state 의 매 goal
final_state = trajectory[-1].state
new_goal = goal_extractor(final_state)
relabeled = []
for t in trajectory:
new_reward = 1.0 if reached(t.next_state, new_goal) else -0.01
relabeled.append(Transition(t.state, t.action, new_reward, t.next_state, new_goal))
new_trajectories.append(relabeled)
return new_trajectories
```
### Reward shaping (caution)
```python
def shaped_reward(state, action, next_state):
base_reward = environment_reward(state, action, next_state)
# 매 distance-based shaping (Ng 1999 — potential-based 안전)
phi = lambda s: -distance_to_goal(s)
shaping = gamma * phi(next_state) - phi(state)
return base_reward + shaping
```
### Process Reward Model (PRM)
```python
def prm_train(model, trajectories):
"""매 각 step 의 quality 의 supervised label."""
for traj in trajectories:
for step in traj.steps:
# 매 human / verifier label
quality = label_step(step.state, step.action, step.reasoning)
loss = model.loss(step, quality)
loss.backward()
optimizer.step()
# 매 inference: 매 each generation step 의 PRM score.
def search_with_prm(prompt, prm, beam=4, depth=10):
candidates = [prompt]
for d in range(depth):
all_candidates = []
for c in candidates:
for cont in generate_n(c, n=beam*2):
score = prm.score(c + cont)
all_candidates.append((c + cont, score))
all_candidates.sort(key=lambda x: -x[1])
candidates = [c for c, _ in all_candidates[:beam]]
return candidates[0]
```
### COMA (multi-agent counterfactual)
```python
def coma_advantage(joint_actions, q_function, agent_idx):
"""매 specific agent 의 contribution = joint Q counterfactual baseline."""
actual_q = q_function(joint_actions)
# 매 agent_idx 의 매 다른 action 의 average
counterfactual_q = 0
for alt_action in action_space:
alt = list(joint_actions)
alt[agent_idx] = alt_action
counterfactual_q += q_function(alt) * policy[agent_idx][alt_action]
return actual_q - counterfactual_q
```
### Attention-based attribution
```python
import torch
def attention_attribution(model, input_ids, target_token_idx):
"""매 매 input token 의 contribution to 매 specific output."""
output = model(input_ids, output_attentions=True)
attentions = output.attentions # 매 N layer × N head × seq × seq
# 매 target token 의 attention to 매 input
avg = torch.stack(attentions).mean(dim=(0, 1, 2)) # 매 reduce
return avg[target_token_idx] # 매 (seq,) — 매 매 input 의 contribution
```
### RLHF token-level credit
```python
def rlhf_token_advantage(generated_tokens, reward_model):
"""매 reward 의 token-level distribute."""
final_reward = reward_model(generated_tokens)
# 매 simple: 매 final 의 모든 token 의 distribute (inefficient)
simple = [final_reward / len(generated_tokens)] * len(generated_tokens)
# 매 better: 매 PRM 의 step-level
prm_scores = process_reward_model.score_each(generated_tokens)
return prm_scores
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Long horizon | TD + GAE |
| Sparse reward | HER + reward shaping |
| Math / multi-step | PRM |
| Deep NN | Backprop |
| Multi-agent | COMA / counterfactual |
| LLM RLHF | PRM > outcome reward |
| Interpretability | SHAP / attention |
**기본값**: 매 GAE (PPO) + 매 PRM (LLM math/code).
## 🔗 Graph
- 부모: [[Reinforcement-Learning]] · [[Optimization]] · [[Deep-Learning]]
- 변형: [[데이터_사이언스_및_ML_엔지니어링|Backpropagation]] · [[TD-Learning]] · [[GAE]] · [[HER]] · [[PRM]]
- 응용: [[PPO]] · [[RLHF]] · [[Best-of-N_Sampling]] · [[Multi-agent-System|Multi-Agent-Systems]]
- Adjacent: [[Computational-Neuroscience-RL]] · [[Bayesian-Brain-Hypothesis]] · [[Bias-Correction-Algorithm]] · [[Causal-Inference]]
## 🤖 LLM 활용
**언제**: 매 RL design. 매 RLHF / PRM. 매 multi-agent system. 매 attribution.
**언제 X**: 매 supervised IID (다른 paradigm).
## ❌ 안티패턴
- **Outcome reward 만 (long horizon)**: 매 sparse signal.
- **Reward shaping 의 careless**: 매 unintended optimal.
- **No eligibility trace** (long): 매 slow learning.
- **PRM 의 noisy label**: 매 wrong attribution.
- **Multi-agent 의 individual reward 의 share**: 매 lazy agent.
## 🧪 검증 / 중복
- Verified (Schulman GAE, Andrychowicz HER, OpenAI PRM, Foerster COMA).
- 신뢰도 A.
- Related: [[Reinforcement-Learning]] · [[Computational-Neuroscience-RL]] · [[RLHF]] · [[Causal-Inference]] · [[Best-of-N_Sampling]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — temporal/structural/multi-agent + 매 GAE / HER / PRM / COMA code |