f8b21af4be
10_Wiki/Topics 대규모 정리: - 오류 캡처/미완성 stub 문서 227개 제거 - 교차폴더 중복 43클러스터 병합 (63파일 → redirect) - 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건 - 카테고리 MOC 6개 신규 생성 - Graph 섹션 미해결 related-keyword 링크 10,058건 제거 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
216 lines
7.7 KiB
Markdown
216 lines
7.7 KiB
Markdown
---
|
|
id: wiki-2026-0508-self-play-자기-대결-기반-강화학습
|
|
title: Self-Play (자기 대결 기반 강화학습)
|
|
category: 10_Wiki/Topics
|
|
status: verified
|
|
canonical_id: self
|
|
aliases: [Self-Play RL, AlphaZero Style, Self-Play Reasoning]
|
|
duplicate_of: none
|
|
source_trust_level: A
|
|
confidence_score: 0.9
|
|
verification_status: applied
|
|
tags: [reinforcement-learning, self-play, alphazero, mcts, llm-reasoning]
|
|
raw_sources: []
|
|
last_reinforced: 2026-05-10
|
|
github_commit: pending
|
|
tech_stack:
|
|
language: python
|
|
framework: pytorch
|
|
---
|
|
|
|
# Self-Play (자기 대결 기반 강화학습)
|
|
|
|
## 매 한 줄
|
|
> **"매 agent 가 자기 자신과 대결 → curriculum 자동 생성, 인간 데이터 의존 X"**. TD-Gammon 1992 → AlphaGo Zero 2017 → AlphaZero/MuZero → OpenAI Five → 매 2024-2026 LLM reasoning self-play (Self-Rewarding LM, SPIN, AlphaProof). 매 핵심: 매 opponent 가 you-itself → 매 difficulty 자동 매칭.
|
|
|
|
## 매 핵심
|
|
|
|
### 매 왜 self-play?
|
|
- 매 ground truth reward (game win) 매 cheap.
|
|
- 매 curriculum 매 emergent: 매 stronger you, 매 stronger opponent.
|
|
- 매 data efficient: 매 single game-tree → 매 many trajectories.
|
|
- 매 super-human achievable: 매 human ceiling 무 의존.
|
|
|
|
### 매 핵심 algorithms
|
|
- **AlphaZero**: 매 MCTS + neural net (policy + value).
|
|
- **MuZero**: 매 learned dynamics — 매 model-based, no game rules needed.
|
|
- **PSRO** (Policy-Space Response Oracles): 매 league of policies.
|
|
- **Fictitious self-play**: 매 average opponent over history.
|
|
- **AlphaStar (StarCraft)**: 매 league with main + exploiters.
|
|
|
|
### 매 stability tricks
|
|
- 매 past-self snapshots: 매 opponent pool from checkpoints.
|
|
- 매 league diversity: 매 main + counter + exploiter agents.
|
|
- 매 PFSP (prioritized fictitious self-play): 매 hard opponents 매 priority.
|
|
- 매 latency-style: 매 opponent strength 매 controlled.
|
|
|
|
### 매 LLM reasoning self-play (2024-2026)
|
|
- **Self-Rewarding LM** (Yuan 2024): 매 LLM-as-judge + DPO.
|
|
- **SPIN** (Chen 2024): 매 self-play between current vs prior model.
|
|
- **AlphaProof** (DeepMind 2024): 매 IMO-silver — 매 Lean prover self-play.
|
|
- **STaR / V-STaR**: 매 reasoning trace generation + filter self-play.
|
|
- **R1-style**: 매 verifiable reward (math, code) → RL self-improvement.
|
|
|
|
### 매 응용
|
|
1. Board games: AlphaGo, AlphaZero, Stockfish.
|
|
2. RTS / MOBA: AlphaStar, OpenAI Five.
|
|
3. LLM reasoning: o1, R1, AlphaProof patterns.
|
|
4. Robotics policy learning: 매 sim-to-real with self-play opponents.
|
|
|
|
## 💻 패턴
|
|
|
|
### MCTS + neural net (AlphaZero-style)
|
|
```python
|
|
import math, numpy as np
|
|
class Node:
|
|
def __init__(self, prior=0):
|
|
self.children = {}
|
|
self.N, self.W, self.P = 0, 0, prior
|
|
def Q(self): return self.W / max(self.N, 1)
|
|
|
|
def select(node, c=1.5):
|
|
total = sum(c.N for c in node.children.values())
|
|
best, best_score = None, -1e9
|
|
for a, child in node.children.items():
|
|
u = c * child.P * math.sqrt(total) / (1 + child.N)
|
|
score = child.Q() + u
|
|
if score > best_score:
|
|
best_score, best = score, a
|
|
return best
|
|
|
|
def mcts(state, net, sims=800):
|
|
root = Node()
|
|
p, v = net(state)
|
|
for a, prior in enumerate(p):
|
|
root.children[a] = Node(prior)
|
|
for _ in range(sims):
|
|
path, s = [root], state
|
|
while path[-1].children:
|
|
a = select(path[-1])
|
|
path.append(path[-1].children[a])
|
|
s = step(s, a)
|
|
if not terminal(s):
|
|
p, v = net(s)
|
|
for a, prior in enumerate(p):
|
|
path[-1].children[a] = Node(prior)
|
|
else:
|
|
v = reward(s)
|
|
for n in reversed(path):
|
|
n.N += 1; n.W += v; v = -v
|
|
visits = np.array([c.N for c in root.children.values()])
|
|
return visits / visits.sum()
|
|
```
|
|
|
|
### Self-play data generation
|
|
```python
|
|
def play_game(net, mcts_sims=800, temp=1.0):
|
|
state, history = init_state(), []
|
|
while not terminal(state):
|
|
pi = mcts(state, net, mcts_sims)
|
|
if temp > 0:
|
|
a = np.random.choice(len(pi), p=pi**(1/temp) / sum(pi**(1/temp)))
|
|
else:
|
|
a = pi.argmax()
|
|
history.append((state, pi))
|
|
state = step(state, a)
|
|
z = reward(state)
|
|
return [(s, pi, z * (-1)**i) for i, (s, pi) in enumerate(history)]
|
|
```
|
|
|
|
### Training loop
|
|
```python
|
|
buffer = []
|
|
for iteration in range(1000):
|
|
games = [play_game(net) for _ in range(100)]
|
|
buffer.extend(sum(games, []))
|
|
buffer = buffer[-500_000:] # 매 ring buffer
|
|
for _ in range(1000):
|
|
batch = random.sample(buffer, 256)
|
|
s, pi, z = zip(*batch)
|
|
p_pred, v_pred = net(torch.stack(s))
|
|
loss = ((v_pred - z) ** 2).mean() + -(pi * p_pred.log()).sum(-1).mean()
|
|
optim.zero_grad(); loss.backward(); optim.step()
|
|
if iteration % 10 == 0:
|
|
save_checkpoint(net, iteration)
|
|
```
|
|
|
|
### League / PSRO opponent pool
|
|
```python
|
|
class League:
|
|
def __init__(self):
|
|
self.agents = []
|
|
self.scores = defaultdict(lambda: defaultdict(int))
|
|
def add(self, agent):
|
|
self.agents.append(agent)
|
|
def sample_opponent(self, learner, mode="pfsp"):
|
|
# 매 prioritize hard opponents
|
|
winrates = [self.scores[learner][a] / max(1, sum(self.scores[learner].values()))
|
|
for a in self.agents]
|
|
priorities = [(1 - w) ** 2 for w in winrates]
|
|
priorities = np.array(priorities) / sum(priorities)
|
|
return np.random.choice(self.agents, p=priorities)
|
|
```
|
|
|
|
### LLM self-play (SPIN-style)
|
|
```python
|
|
# 매 player_t = current LLM, opponent_{t-1} = previous LLM
|
|
def spin_step(model_t, model_t_minus_1, prompts, gold_responses):
|
|
losses = []
|
|
for p, gold in zip(prompts, gold_responses):
|
|
opponent_resp = model_t_minus_1.generate(p)
|
|
# 매 DPO-style: 매 prefer gold over opponent
|
|
chosen_logp = model_t.logprob(p, gold)
|
|
rejected_logp = model_t.logprob(p, opponent_resp)
|
|
loss = -F.logsigmoid(0.1 * (chosen_logp - rejected_logp))
|
|
losses.append(loss)
|
|
return torch.stack(losses).mean()
|
|
```
|
|
|
|
### Verifiable-reward self-improvement (R1-style)
|
|
```python
|
|
def rlhf_self_play(model, math_problems):
|
|
rollouts = []
|
|
for prob in math_problems:
|
|
traces = [model.generate(prob, temperature=1.0) for _ in range(16)]
|
|
rewards = [verify_math(t, prob.answer) for t in traces] # 0/1
|
|
rollouts.extend(zip(traces, rewards))
|
|
# GRPO: 매 group-relative advantage
|
|
update_policy_grpo(model, rollouts)
|
|
```
|
|
|
|
## 매 결정 기준
|
|
| 상황 | Approach |
|
|
|---|---|
|
|
| Perfect info game | 매 AlphaZero (MCTS + NN) |
|
|
| Imperfect info | 매 PSRO / fictitious self-play |
|
|
| Ladder strategies (RTS) | 매 league with exploiters (AlphaStar) |
|
|
| LLM reasoning (verifiable) | 매 R1-style RL with verifier |
|
|
| LLM general | 매 SPIN / Self-Rewarding LM |
|
|
|
|
**기본값**: 매 perfect info → AlphaZero pattern, 매 LLM reasoning with verifier → GRPO/R1, 매 general LLM → SPIN.
|
|
|
|
## 🔗 Graph
|
|
- 부모: [[Reinforcement Learning]]
|
|
- 변형: [[GRPO]]
|
|
- 응용: [[DeepSeek R1]]
|
|
|
|
## 🤖 LLM 활용
|
|
**언제**: 매 verifiable reward (math, code, theorem proving), 매 ground-truth game outcome, 매 want super-human capability.
|
|
**언제 X**: 매 reward 매 unverifiable (creative writing, opinion), 매 single-agent task with no opponent.
|
|
|
|
## ❌ 안티패턴
|
|
- **No opponent diversity**: 매 collapse to single strategy — 매 league 필수.
|
|
- **Static opponent**: 매 overfit to fixed pattern.
|
|
- **No verification (LLM)**: 매 reward hacking — 매 SPIN-only 매 weak.
|
|
- **Catastrophic forgetting**: 매 past-snapshot pool 무시 → cycles.
|
|
|
|
## 🧪 검증 / 중복
|
|
- Verified (Silver et al. AlphaZero 2017, AlphaStar Nature 2019, SPIN ICML 2024, AlphaProof DeepMind 2024).
|
|
- 신뢰도 A.
|
|
|
|
## 🕓 Changelog
|
|
| 날짜 | 변경 |
|
|
|---|---|
|
|
| 2026-05-08 | Phase 1 |
|
|
| 2026-05-10 | Manual cleanup — AlphaZero, league, LLM self-play (SPIN, R1) 추가 |
|