Files

T

Antigravity Agent 504fd5fb42 [G1-Sync] Manual knowledge update

2026-05-10 22:08:15 +09:00

8.0 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

Self-Play (자기 대결 기반 강화학습)

매 한 줄

"매 agent 가 자기 자신과 대결 → curriculum 자동 생성, 인간 데이터 의존 X". TD-Gammon 1992 → AlphaGo Zero 2017 → AlphaZero/MuZero → OpenAI Five → 매 2024-2026 LLM reasoning self-play (Self-Rewarding LM, SPIN, AlphaProof). 매 핵심: 매 opponent 가 you-itself → 매 difficulty 자동 매칭.

매 핵심

매 왜 self-play?

매 ground truth reward (game win) 매 cheap.
매 curriculum 매 emergent: 매 stronger you, 매 stronger opponent.
매 data efficient: 매 single game-tree → 매 many trajectories.
매 super-human achievable: 매 human ceiling 무 의존.

매 핵심 algorithms

AlphaZero: 매 MCTS + neural net (policy + value).
MuZero: 매 learned dynamics — 매 model-based, no game rules needed.
PSRO (Policy-Space Response Oracles): 매 league of policies.
Fictitious self-play: 매 average opponent over history.
AlphaStar (StarCraft): 매 league with main + exploiters.

매 stability tricks

매 past-self snapshots: 매 opponent pool from checkpoints.
매 league diversity: 매 main + counter + exploiter agents.
매 PFSP (prioritized fictitious self-play): 매 hard opponents 매 priority.
매 latency-style: 매 opponent strength 매 controlled.

매 LLM reasoning self-play (2024-2026)

Self-Rewarding LM (Yuan 2024): 매 LLM-as-judge + DPO.
SPIN (Chen 2024): 매 self-play between current vs prior model.
AlphaProof (DeepMind 2024): 매 IMO-silver — 매 Lean prover self-play.
STaR / V-STaR: 매 reasoning trace generation + filter self-play.
R1-style: 매 verifiable reward (math, code) → RL self-improvement.

매 응용

Board games: AlphaGo, AlphaZero, Stockfish.
RTS / MOBA: AlphaStar, OpenAI Five.
LLM reasoning: o1, R1, AlphaProof patterns.
Robotics policy learning: 매 sim-to-real with self-play opponents.

💻 패턴

MCTS + neural net (AlphaZero-style)

import math, numpy as np
class Node:
    def __init__(self, prior=0):
        self.children = {}
        self.N, self.W, self.P = 0, 0, prior
    def Q(self): return self.W / max(self.N, 1)

def select(node, c=1.5):
    total = sum(c.N for c in node.children.values())
    best, best_score = None, -1e9
    for a, child in node.children.items():
        u = c * child.P * math.sqrt(total) / (1 + child.N)
        score = child.Q() + u
        if score > best_score:
            best_score, best = score, a
    return best

def mcts(state, net, sims=800):
    root = Node()
    p, v = net(state)
    for a, prior in enumerate(p):
        root.children[a] = Node(prior)
    for _ in range(sims):
        path, s = [root], state
        while path[-1].children:
            a = select(path[-1])
            path.append(path[-1].children[a])
            s = step(s, a)
        if not terminal(s):
            p, v = net(s)
            for a, prior in enumerate(p):
                path[-1].children[a] = Node(prior)
        else:
            v = reward(s)
        for n in reversed(path):
            n.N += 1; n.W += v; v = -v
    visits = np.array([c.N for c in root.children.values()])
    return visits / visits.sum()

Self-play data generation

def play_game(net, mcts_sims=800, temp=1.0):
    state, history = init_state(), []
    while not terminal(state):
        pi = mcts(state, net, mcts_sims)
        if temp > 0:
            a = np.random.choice(len(pi), p=pi**(1/temp) / sum(pi**(1/temp)))
        else:
            a = pi.argmax()
        history.append((state, pi))
        state = step(state, a)
    z = reward(state)
    return [(s, pi, z * (-1)**i) for i, (s, pi) in enumerate(history)]

Training loop

buffer = []
for iteration in range(1000):
    games = [play_game(net) for _ in range(100)]
    buffer.extend(sum(games, []))
    buffer = buffer[-500_000:]  # 매 ring buffer
    for _ in range(1000):
        batch = random.sample(buffer, 256)
        s, pi, z = zip(*batch)
        p_pred, v_pred = net(torch.stack(s))
        loss = ((v_pred - z) ** 2).mean() + -(pi * p_pred.log()).sum(-1).mean()
        optim.zero_grad(); loss.backward(); optim.step()
    if iteration % 10 == 0:
        save_checkpoint(net, iteration)

League / PSRO opponent pool

class League:
    def __init__(self):
        self.agents = []
        self.scores = defaultdict(lambda: defaultdict(int))
    def add(self, agent):
        self.agents.append(agent)
    def sample_opponent(self, learner, mode="pfsp"):
        # 매 prioritize hard opponents
        winrates = [self.scores[learner][a] / max(1, sum(self.scores[learner].values()))
                    for a in self.agents]
        priorities = [(1 - w) ** 2 for w in winrates]
        priorities = np.array(priorities) / sum(priorities)
        return np.random.choice(self.agents, p=priorities)

LLM self-play (SPIN-style)

# 매 player_t = current LLM, opponent_{t-1} = previous LLM
def spin_step(model_t, model_t_minus_1, prompts, gold_responses):
    losses = []
    for p, gold in zip(prompts, gold_responses):
        opponent_resp = model_t_minus_1.generate(p)
        # 매 DPO-style: 매 prefer gold over opponent
        chosen_logp = model_t.logprob(p, gold)
        rejected_logp = model_t.logprob(p, opponent_resp)
        loss = -F.logsigmoid(0.1 * (chosen_logp - rejected_logp))
        losses.append(loss)
    return torch.stack(losses).mean()

Verifiable-reward self-improvement (R1-style)

def rlhf_self_play(model, math_problems):
    rollouts = []
    for prob in math_problems:
        traces = [model.generate(prob, temperature=1.0) for _ in range(16)]
        rewards = [verify_math(t, prob.answer) for t in traces]  # 0/1
        rollouts.extend(zip(traces, rewards))
    # GRPO: 매 group-relative advantage
    update_policy_grpo(model, rollouts)

매 결정 기준

상황	Approach
Perfect info game	매 AlphaZero (MCTS + NN)
Imperfect info	매 PSRO / fictitious self-play
Ladder strategies (RTS)	매 league with exploiters (AlphaStar)
LLM reasoning (verifiable)	매 R1-style RL with verifier
LLM general	매 SPIN / Self-Rewarding LM

기본값: 매 perfect info → AlphaZero pattern, 매 LLM reasoning with verifier → GRPO/R1, 매 general LLM → SPIN.

🔗 Graph

부모: Reinforcement Learning · Multi-Agent RL
변형: AlphaZero · MuZero · PSRO · SPIN · GRPO
응용: AlphaGo · AlphaStar · OpenAI Five · AlphaProof · DeepSeek R1
Adjacent: Monte Carlo Tree Search · Game Theory · Curriculum Learning

🤖 LLM 활용

언제: 매 verifiable reward (math, code, theorem proving), 매 ground-truth game outcome, 매 want super-human capability. 언제 X: 매 reward 매 unverifiable (creative writing, opinion), 매 single-agent task with no opponent.

❌ 안티패턴

No opponent diversity: 매 collapse to single strategy — 매 league 필수.
Static opponent: 매 overfit to fixed pattern.
No verification (LLM): 매 reward hacking — 매 SPIN-only 매 weak.
Catastrophic forgetting: 매 past-snapshot pool 무시 → cycles.

🧪 검증 / 중복

Verified (Silver et al. AlphaZero 2017, AlphaStar Nature 2019, SPIN ICML 2024, AlphaProof DeepMind 2024).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — AlphaZero, league, LLM self-play (SPIN, R1) 추가

8.0 KiB Raw Blame History