--- id: wiki-2026-0508-self-play-자기-대결-기반-강화학습 title: Self-Play (자기 대결 기반 강화학습) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [Self-Play RL, AlphaZero Style, Self-Play Reasoning] duplicate_of: none source_trust_level: A confidence_score: 0.9 verification_status: applied tags: [reinforcement-learning, self-play, alphazero, mcts, llm-reasoning] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: python framework: pytorch --- # Self-Play (자기 대결 기반 강화학습) ## 매 한 줄 > **"매 agent 가 자기 자신과 대결 → curriculum 자동 생성, 인간 데이터 의존 X"**. TD-Gammon 1992 → AlphaGo Zero 2017 → AlphaZero/MuZero → OpenAI Five → 매 2024-2026 LLM reasoning self-play (Self-Rewarding LM, SPIN, AlphaProof). 매 핵심: 매 opponent 가 you-itself → 매 difficulty 자동 매칭. ## 매 핵심 ### 매 왜 self-play? - 매 ground truth reward (game win) 매 cheap. - 매 curriculum 매 emergent: 매 stronger you, 매 stronger opponent. - 매 data efficient: 매 single game-tree → 매 many trajectories. - 매 super-human achievable: 매 human ceiling 무 의존. ### 매 핵심 algorithms - **AlphaZero**: 매 MCTS + neural net (policy + value). - **MuZero**: 매 learned dynamics — 매 model-based, no game rules needed. - **PSRO** (Policy-Space Response Oracles): 매 league of policies. - **Fictitious self-play**: 매 average opponent over history. - **AlphaStar (StarCraft)**: 매 league with main + exploiters. ### 매 stability tricks - 매 past-self snapshots: 매 opponent pool from checkpoints. - 매 league diversity: 매 main + counter + exploiter agents. - 매 PFSP (prioritized fictitious self-play): 매 hard opponents 매 priority. - 매 latency-style: 매 opponent strength 매 controlled. ### 매 LLM reasoning self-play (2024-2026) - **Self-Rewarding LM** (Yuan 2024): 매 LLM-as-judge + DPO. - **SPIN** (Chen 2024): 매 self-play between current vs prior model. - **AlphaProof** (DeepMind 2024): 매 IMO-silver — 매 Lean prover self-play. - **STaR / V-STaR**: 매 reasoning trace generation + filter self-play. - **R1-style**: 매 verifiable reward (math, code) → RL self-improvement. ### 매 응용 1. Board games: AlphaGo, AlphaZero, Stockfish. 2. RTS / MOBA: AlphaStar, OpenAI Five. 3. LLM reasoning: o1, R1, AlphaProof patterns. 4. Robotics policy learning: 매 sim-to-real with self-play opponents. ## 💻 패턴 ### MCTS + neural net (AlphaZero-style) ```python import math, numpy as np class Node: def __init__(self, prior=0): self.children = {} self.N, self.W, self.P = 0, 0, prior def Q(self): return self.W / max(self.N, 1) def select(node, c=1.5): total = sum(c.N for c in node.children.values()) best, best_score = None, -1e9 for a, child in node.children.items(): u = c * child.P * math.sqrt(total) / (1 + child.N) score = child.Q() + u if score > best_score: best_score, best = score, a return best def mcts(state, net, sims=800): root = Node() p, v = net(state) for a, prior in enumerate(p): root.children[a] = Node(prior) for _ in range(sims): path, s = [root], state while path[-1].children: a = select(path[-1]) path.append(path[-1].children[a]) s = step(s, a) if not terminal(s): p, v = net(s) for a, prior in enumerate(p): path[-1].children[a] = Node(prior) else: v = reward(s) for n in reversed(path): n.N += 1; n.W += v; v = -v visits = np.array([c.N for c in root.children.values()]) return visits / visits.sum() ``` ### Self-play data generation ```python def play_game(net, mcts_sims=800, temp=1.0): state, history = init_state(), [] while not terminal(state): pi = mcts(state, net, mcts_sims) if temp > 0: a = np.random.choice(len(pi), p=pi**(1/temp) / sum(pi**(1/temp))) else: a = pi.argmax() history.append((state, pi)) state = step(state, a) z = reward(state) return [(s, pi, z * (-1)**i) for i, (s, pi) in enumerate(history)] ``` ### Training loop ```python buffer = [] for iteration in range(1000): games = [play_game(net) for _ in range(100)] buffer.extend(sum(games, [])) buffer = buffer[-500_000:] # 매 ring buffer for _ in range(1000): batch = random.sample(buffer, 256) s, pi, z = zip(*batch) p_pred, v_pred = net(torch.stack(s)) loss = ((v_pred - z) ** 2).mean() + -(pi * p_pred.log()).sum(-1).mean() optim.zero_grad(); loss.backward(); optim.step() if iteration % 10 == 0: save_checkpoint(net, iteration) ``` ### League / PSRO opponent pool ```python class League: def __init__(self): self.agents = [] self.scores = defaultdict(lambda: defaultdict(int)) def add(self, agent): self.agents.append(agent) def sample_opponent(self, learner, mode="pfsp"): # 매 prioritize hard opponents winrates = [self.scores[learner][a] / max(1, sum(self.scores[learner].values())) for a in self.agents] priorities = [(1 - w) ** 2 for w in winrates] priorities = np.array(priorities) / sum(priorities) return np.random.choice(self.agents, p=priorities) ``` ### LLM self-play (SPIN-style) ```python # 매 player_t = current LLM, opponent_{t-1} = previous LLM def spin_step(model_t, model_t_minus_1, prompts, gold_responses): losses = [] for p, gold in zip(prompts, gold_responses): opponent_resp = model_t_minus_1.generate(p) # 매 DPO-style: 매 prefer gold over opponent chosen_logp = model_t.logprob(p, gold) rejected_logp = model_t.logprob(p, opponent_resp) loss = -F.logsigmoid(0.1 * (chosen_logp - rejected_logp)) losses.append(loss) return torch.stack(losses).mean() ``` ### Verifiable-reward self-improvement (R1-style) ```python def rlhf_self_play(model, math_problems): rollouts = [] for prob in math_problems: traces = [model.generate(prob, temperature=1.0) for _ in range(16)] rewards = [verify_math(t, prob.answer) for t in traces] # 0/1 rollouts.extend(zip(traces, rewards)) # GRPO: 매 group-relative advantage update_policy_grpo(model, rollouts) ``` ## 매 결정 기준 | 상황 | Approach | |---|---| | Perfect info game | 매 AlphaZero (MCTS + NN) | | Imperfect info | 매 PSRO / fictitious self-play | | Ladder strategies (RTS) | 매 league with exploiters (AlphaStar) | | LLM reasoning (verifiable) | 매 R1-style RL with verifier | | LLM general | 매 SPIN / Self-Rewarding LM | **기본값**: 매 perfect info → AlphaZero pattern, 매 LLM reasoning with verifier → GRPO/R1, 매 general LLM → SPIN. ## 🔗 Graph - 부모: [[Reinforcement Learning]] · [[Multi-Agent RL]] - 변형: [[AlphaZero]] · [[MuZero]] · [[PSRO]] · [[SPIN]] · [[GRPO]] - 응용: [[AlphaGo]] · [[AlphaStar]] · [[OpenAI Five]] · [[AlphaProof]] · [[DeepSeek R1]] - Adjacent: [[Monte Carlo Tree Search]] · [[Game Theory]] · [[Curriculum Learning]] ## 🤖 LLM 활용 **언제**: 매 verifiable reward (math, code, theorem proving), 매 ground-truth game outcome, 매 want super-human capability. **언제 X**: 매 reward 매 unverifiable (creative writing, opinion), 매 single-agent task with no opponent. ## ❌ 안티패턴 - **No opponent diversity**: 매 collapse to single strategy — 매 league 필수. - **Static opponent**: 매 overfit to fixed pattern. - **No verification (LLM)**: 매 reward hacking — 매 SPIN-only 매 weak. - **Catastrophic forgetting**: 매 past-snapshot pool 무시 → cycles. ## 🧪 검증 / 중복 - Verified (Silver et al. AlphaZero 2017, AlphaStar Nature 2019, SPIN ICML 2024, AlphaProof DeepMind 2024). - 신뢰도 A. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — AlphaZero, league, LLM self-play (SPIN, R1) 추가 |