Files

T

Antigravity Agent f8b21af4be Wiki cleanup: error-doc removal, dedup merge, link normalization

10_Wiki/Topics 대규모 정리:
- 오류 캡처/미완성 stub 문서 227개 제거
- 교차폴더 중복 43클러스터 병합 (63파일 → redirect)
- 링크명 정규화: 깨진 링크 수정·redirect 직결·개념 매핑 ~2,400건
- 카테고리 MOC 6개 신규 생성
- Graph 섹션 미해결 related-keyword 링크 10,058건 제거

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-20 23:52:15 +09:00

7.4 KiB

Raw Blame History

id, title, category, status, canonical_id, aliases, duplicate_of, source_trust_level, confidence_score, verification_status, tags, raw_sources, last_reinforced, github_commit, tech_stack

title

GRPO (Group Relative Policy Optimization)

매 한 줄

"매 PPO 의 critic-free variant — 매 group 의 sample 의 의 의 baseline". DeepSeek 2024-2025. 매 R1 reasoning 의 enable. 매 reward model 의 의 의 X (rule-based reward 의 충분). 매 modern RLHF / reasoning 의 popular.

매 핵심

매 vs PPO

PPO: 매 critic (value network).
GRPO: 매 group sample 의 mean 의 baseline.
Result: 매 simpler, 매 reasoning 의 strong.

매 algorithm

매 prompt → 매 G rollouts (different responses).
매 reward 의 매 rollout 의 score.
매 advantage = (reward - group_mean) / group_std.
매 PPO-style clipped objective.

매 famous

DeepSeek-Math (2024).
DeepSeek-R1 (2025): 매 reasoning emerge.

매 응용

Math reasoning.
Code generation.
Tool use.
Long CoT.

💻 패턴

Basic GRPO loop

import torch
import torch.nn.functional as F

def grpo_step(policy, ref_policy, prompts, reward_fn, group_size=8, beta=0.04, eps=0.2):
    advantages_all = []
    log_probs_old_all = []
    log_probs_ref_all = []
    responses_all = []
    
    for prompt in prompts:
        # 매 G rollouts
        rollouts = []
        rewards = []
        for _ in range(group_size):
            response = policy.generate(prompt, do_sample=True)
            r = reward_fn(prompt, response)
            rollouts.append(response); rewards.append(r)
        
        rewards = torch.tensor(rewards)
        # 매 group baseline
        adv = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
        advantages_all.extend(adv.tolist())
        
        # 매 log prob
        for resp in rollouts:
            log_probs_old_all.append(policy.log_prob(prompt, resp).detach())
            log_probs_ref_all.append(ref_policy.log_prob(prompt, resp).detach())
            responses_all.append((prompt, resp))
    
    # 매 PPO-style update
    for _ in range(4):  # 매 ppo epochs
        for (prompt, resp), adv, log_old, log_ref in zip(responses_all, advantages_all, log_probs_old_all, log_probs_ref_all):
            log_new = policy.log_prob(prompt, resp)
            ratio = (log_new - log_old).exp()
            
            obj1 = ratio * adv
            obj2 = ratio.clamp(1 - eps, 1 + eps) * adv
            policy_loss = -torch.min(obj1, obj2).mean()
            
            # 매 KL penalty (vs ref)
            kl = log_new - log_ref
            kl_loss = beta * kl.mean()
            
            loss = policy_loss + kl_loss
            loss.backward()
            optim.step(); optim.zero_grad()

Rule-based reward (math)

def math_reward(prompt, response):
    """매 deepseek-style: extract answer, verify."""
    answer = extract_answer(response)
    expected = extract_answer(prompt['solution'])
    
    correctness = 1.0 if answer == expected else 0.0
    format_bonus = 0.1 if has_required_format(response) else 0
    
    return correctness + format_bonus

TRL implementation

from trl import GRPOTrainer, GRPOConfig

trainer = GRPOTrainer(
    model='Qwen/Qwen2.5-7B',
    reward_funcs=[correctness_reward, format_reward],
    args=GRPOConfig(
        output_dir='out',
        num_generations=8,  # 매 group size
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        learning_rate=5e-6,
        max_prompt_length=512,
        max_completion_length=1024,
        beta=0.04,
    ),
    train_dataset=ds,
)
trainer.train()

Multi-objective reward

def multi_reward(prompt, response):
    rewards = {}
    rewards['correctness'] = correctness(prompt, response)
    rewards['format'] = check_format(response)
    rewards['length'] = -abs(len(response) - 500) / 1000  # 매 prefer ~500 tokens
    rewards['cot_quality'] = check_reasoning_quality(response)
    
    weights = {'correctness': 1.0, 'format': 0.1, 'length': 0.05, 'cot_quality': 0.3}
    return sum(rewards[k] * weights[k] for k in rewards)

Reasoning-focused (R1-style)

THINK_FORMAT = """
<think>
{reasoning}
</think>
<answer>
{answer}
</answer>
"""

def r1_format_reward(response):
    has_think = '<think>' in response and '</think>' in response
    has_answer = '<answer>' in response and '</answer>' in response
    return 0.5 if (has_think and has_answer) else 0

Self-consistency (best-of-N at eval)

def best_of_n_eval(model, prompt, n=16):
    responses = [model.generate(prompt, do_sample=True) for _ in range(n)]
    answers = [extract_answer(r) for r in responses]
    # 매 majority vote
    from collections import Counter
    return Counter(answers).most_common(1)[0][0]

KL control

def adaptive_beta(target_kl, current_kl, beta):
    if current_kl > 1.5 * target_kl: return beta * 1.5
    if current_kl < 0.5 * target_kl: return beta / 1.5
    return beta

Reward hacking detection

def detect_reward_hacking(rollouts, rewards):
    """매 high reward 의 의 의 quality 의 X?"""
    high_reward = [r for r, score in zip(rollouts, rewards) if score > 0.9]
    quality = [llm_judge_quality(r) for r in high_reward]
    if np.mean(quality) < 0.5:
        return 'WARN: high reward but low quality — possibly hacking'
    return None

Process reward (PRM)

def process_reward(steps):
    """매 step-by-step verify."""
    return sum(prm_score(step) for step in steps) / len(steps)

Iterative training (R1-style)

def r1_pipeline(base_model, dataset):
    # 매 stage 1: reasoning data SFT
    sft_model = sft(base_model, reasoning_data)
    
    # 매 stage 2: GRPO
    grpo_model = grpo(sft_model, dataset, math_reward)
    
    # 매 stage 3: rejection sampling — 매 high-quality 의 SFT 다시
    rs_data = filter_high_quality(grpo_model.generate_many(dataset))
    final = sft(grpo_model, rs_data)
    
    return final

매 결정 기준

상황	Approach
Reasoning task	GRPO + rule reward
Preference align	DPO / PPO
Code	GRPO + execution reward
General chat	RLHF / DPO
Tool use	GRPO + success reward
Cost-aware	GRPO (no critic)

기본값: 매 reasoning = GRPO + rule + format reward + iterative + KL control.

🔗 Graph

부모: RLHF · Reinforcement-Learning
변형: PPO · DPO
응용: DeepSeek-R1
Adjacent: Fine-tuning · Foundation-Models

🤖 LLM 활용

언제: 매 reasoning, math, code. 매 verifiable reward. 언제 X: 매 subjective preference (use DPO).

❌ 안티패턴

No KL control: 매 reward hack drift.
Tiny group: 매 noisy advantage.
No rule for format: 매 hack format.
Single-objective: 매 hacking.

🧪 검증 / 중복

Verified (DeepSeek-Math 2024, DeepSeek-R1 2025, TRL docs).
신뢰도 A.

🕓 Changelog

날짜	변경
2026-05-08	Phase 1
2026-05-10	Manual cleanup — GRPO + 매 TRL / R1 / multi-reward / pipeline code

7.4 KiB Raw Blame History