Files
2nd/10_Wiki/Topics/AI_and_ML/Credit Assignment Problem.md
T
koriweb d8a80f6272 chore(wiki): dangling 링크 canonical 정규화 (768파일/1200건)
이름만 다른(표기 변형) [[위키링크]]를 대상 문서의 canonical 제목으로 치환해
끊겼던 1,200개 링크를 연결. 제목/파일명 정규화 일치만 적용하고 별칭 매칭은
과병합 위험으로 제외(애매성 가드). 원본은 _link_reconcile_backup/ 에 백업.
도구: Datacollect/scripts/link_reconcile_apply.mjs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 12:24:15 +09:00

284 lines
9.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: wiki-2026-0508-credit-assignment
title: Credit Assignment Problem
category: 10_Wiki/Topics
status: verified
canonical_id: self
aliases: [credit assignment, temporal credit, structural credit, backpropagation, GAE, PRM, attribution]
duplicate_of: none
source_trust_level: A
confidence_score: 0.93
verification_status: applied
tags: [reinforcement-learning, credit-assignment, backpropagation, gae, prm, attribution, multi-agent, llm]
raw_sources: []
last_reinforced: 2026-05-10
github_commit: pending
tech_stack:
language: Python
framework: PyTorch / JAX / RL libs
---
# Credit Assignment Problem
## 매 한 줄
> **"매 누가 / 매 무엇 의 reward 의 기여?"**. 매 long sequence 의 final reward 의 매 step 별 attribution. 매 RL 의 fundamental + 매 deep learning 의 backprop 의 essence. 매 modern: GAE, PRM, RLHF, multi-agent.
## 매 핵심 type
### Temporal Credit Assignment
- 매 sequence of action → 매 final reward.
- 매 어떤 action 의 결정?
- 매 RL 의 challenge.
### Structural Credit Assignment
- 매 layered NN → 매 error.
- 매 어떤 weight / neuron 의 fix?
- 매 backprop 의 solve.
### Multi-agent Credit
- 매 N agent → 매 collective reward.
- 매 individual contribution.
## 매 solution
### Backpropagation (structural)
- 매 chain rule.
- 매 each layer 의 gradient.
- 매 1986 Rumelhart-Hinton-Williams.
### TD Learning (temporal)
- 매 bootstrap.
- 매 [[Computational-Neuroscience-RL]] 참조.
### Eligibility Trace
- 매 past action 의 trace 유지.
- 매 TD(λ).
### GAE (Generalized Advantage Estimation)
- Schulman 2015.
- 매 bias-variance trade-off.
- 매 PPO 의 standard.
- $A_t = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$
### Hindsight Experience Replay (HER)
- 매 fail trajectory 의 매 different goal 의 reuse.
### Reward Shaping
- 매 dense intermediate reward.
- 매 careful: 매 unintended optimal X.
### Process Reward Model (PRM, modern)
- 매 매 step 의 grade.
- 매 OpenAI Math, 매 DeepSeek-Prover.
- 매 outcome reward 보다 매 finer.
### Counterfactual (multi-agent)
- COMA (Counterfactual Multi-Agent Policy Gradient).
- 매 1 agent 의 fix → 매 contribution.
### Attention attribution (LLM)
- 매 attention score 의 attribution.
- 매 SHAP, integrated gradient.
## 매 응용
1. **Game AI**: 매 chess / Go (long horizon).
2. **Robotics**: 매 sparse reward.
3. **LLM RLHF**: 매 token-level reward.
4. **Multi-agent**: 매 cooperative.
5. **Medical**: 매 long-term outcome.
6. **Finance**: 매 portfolio.
## 💻 패턴
### Backpropagation (structural)
```python
import torch
x = torch.tensor([1.0, 2.0], requires_grad=True)
W1 = torch.randn(2, 3, requires_grad=True)
W2 = torch.randn(3, 1, requires_grad=True)
h = torch.relu(x @ W1)
y = h @ W2
loss = (y - target).pow(2).mean()
loss.backward() # 매 W1, W2 의 gradient (credit) 계산.
print(W1.grad) # 매 each weight 의 contribution.
```
### TD(0) (temporal)
```python
def td_update(V, state, reward, next_state, alpha=0.1, gamma=0.95):
td_error = reward + gamma * V[next_state] - V[state]
V[state] += alpha * td_error
return V
```
### TD(λ) with eligibility trace
```python
class TDLambda:
def __init__(self, n_states, alpha=0.1, gamma=0.95, lam=0.9):
self.V = np.zeros(n_states)
self.e = np.zeros(n_states)
self.alpha, self.gamma, self.lam = alpha, gamma, lam
def update(self, state, reward, next_state):
td_error = reward + self.gamma * self.V[next_state] - self.V[state]
self.e *= self.gamma * self.lam
self.e[state] += 1 # 매 visited state 의 trace 증가
self.V += self.alpha * td_error * self.e # 매 trace 비례 update
```
### GAE (PPO standard)
```python
def compute_gae(rewards, values, gamma=0.99, lam=0.95):
"""매 매 step 의 advantage."""
advantages = np.zeros_like(rewards)
last_gae = 0
for t in reversed(range(len(rewards))):
delta = rewards[t] + gamma * values[t+1] - values[t]
advantages[t] = last_gae = delta + gamma * lam * last_gae
return advantages
```
### Hindsight Experience Replay
```python
def her_relabel(trajectory, goal_extractor):
"""매 failure 의 매 goal 의 reach 의 success 로 relabel."""
new_trajectories = [trajectory]
# 매 final state 의 매 goal
final_state = trajectory[-1].state
new_goal = goal_extractor(final_state)
relabeled = []
for t in trajectory:
new_reward = 1.0 if reached(t.next_state, new_goal) else -0.01
relabeled.append(Transition(t.state, t.action, new_reward, t.next_state, new_goal))
new_trajectories.append(relabeled)
return new_trajectories
```
### Reward shaping (caution)
```python
def shaped_reward(state, action, next_state):
base_reward = environment_reward(state, action, next_state)
# 매 distance-based shaping (Ng 1999 — potential-based 안전)
phi = lambda s: -distance_to_goal(s)
shaping = gamma * phi(next_state) - phi(state)
return base_reward + shaping
```
### Process Reward Model (PRM)
```python
def prm_train(model, trajectories):
"""매 각 step 의 quality 의 supervised label."""
for traj in trajectories:
for step in traj.steps:
# 매 human / verifier label
quality = label_step(step.state, step.action, step.reasoning)
loss = model.loss(step, quality)
loss.backward()
optimizer.step()
# 매 inference: 매 each generation step 의 PRM score.
def search_with_prm(prompt, prm, beam=4, depth=10):
candidates = [prompt]
for d in range(depth):
all_candidates = []
for c in candidates:
for cont in generate_n(c, n=beam*2):
score = prm.score(c + cont)
all_candidates.append((c + cont, score))
all_candidates.sort(key=lambda x: -x[1])
candidates = [c for c, _ in all_candidates[:beam]]
return candidates[0]
```
### COMA (multi-agent counterfactual)
```python
def coma_advantage(joint_actions, q_function, agent_idx):
"""매 specific agent 의 contribution = joint Q counterfactual baseline."""
actual_q = q_function(joint_actions)
# 매 agent_idx 의 매 다른 action 의 average
counterfactual_q = 0
for alt_action in action_space:
alt = list(joint_actions)
alt[agent_idx] = alt_action
counterfactual_q += q_function(alt) * policy[agent_idx][alt_action]
return actual_q - counterfactual_q
```
### Attention-based attribution
```python
import torch
def attention_attribution(model, input_ids, target_token_idx):
"""매 매 input token 의 contribution to 매 specific output."""
output = model(input_ids, output_attentions=True)
attentions = output.attentions # 매 N layer × N head × seq × seq
# 매 target token 의 attention to 매 input
avg = torch.stack(attentions).mean(dim=(0, 1, 2)) # 매 reduce
return avg[target_token_idx] # 매 (seq,) — 매 매 input 의 contribution
```
### RLHF token-level credit
```python
def rlhf_token_advantage(generated_tokens, reward_model):
"""매 reward 의 token-level distribute."""
final_reward = reward_model(generated_tokens)
# 매 simple: 매 final 의 모든 token 의 distribute (inefficient)
simple = [final_reward / len(generated_tokens)] * len(generated_tokens)
# 매 better: 매 PRM 의 step-level
prm_scores = process_reward_model.score_each(generated_tokens)
return prm_scores
```
## 매 결정 기준
| 상황 | Approach |
|---|---|
| Long horizon | TD + GAE |
| Sparse reward | HER + reward shaping |
| Math / multi-step | PRM |
| Deep NN | Backprop |
| Multi-agent | COMA / counterfactual |
| LLM RLHF | PRM > outcome reward |
| Interpretability | SHAP / attention |
**기본값**: 매 GAE (PPO) + 매 PRM (LLM math/code).
## 🔗 Graph
- 부모: [[Reinforcement-Learning]] · [[Optimization]] · [[Deep Learning]]
- 변형: [[데이터 사이언스 및 ML 엔지니어링|Backpropagation]] · [[TD-Learning]] · [[GAE]] · [[HER]] · [[PRM]]
- 응용: [[PPO]] · [[RLHF]] · [[Best-of-N_Sampling]] · [[Multi-agent-System|Multi-Agent-Systems]]
- Adjacent: [[Computational-Neuroscience-RL]] · [[Bayesian-Brain-Hypothesis]] · [[Bias-Correction-Algorithm]] · [[Causal-Inference]]
## 🤖 LLM 활용
**언제**: 매 RL design. 매 RLHF / PRM. 매 multi-agent system. 매 attribution.
**언제 X**: 매 supervised IID (다른 paradigm).
## ❌ 안티패턴
- **Outcome reward 만 (long horizon)**: 매 sparse signal.
- **Reward shaping 의 careless**: 매 unintended optimal.
- **No eligibility trace** (long): 매 slow learning.
- **PRM 의 noisy label**: 매 wrong attribution.
- **Multi-agent 의 individual reward 의 share**: 매 lazy agent.
## 🧪 검증 / 중복
- Verified (Schulman GAE, Andrychowicz HER, OpenAI PRM, Foerster COMA).
- 신뢰도 A.
- Related: [[Reinforcement-Learning]] · [[Computational-Neuroscience-RL]] · [[RLHF]] · [[Causal-Inference]] · [[Best-of-N_Sampling]].
## 🕓 Changelog
| 날짜 | 변경 |
|---|---|
| 2026-05-08 | Phase 1 |
| 2026-05-10 | Manual cleanup — temporal/structural/multi-agent + 매 GAE / HER / PRM / COMA code |