Files
2nd/10_Wiki/Topics/Coding/AI_RLHF_DPO_Basics.md
T
2026-05-10 22:08:15 +09:00

7.9 KiB
Raw Blame History

id, title, category, status, source_trust_level, verification_status, created_at, updated_at, tags, tech_stack, applied_in, aliases
id title category status source_trust_level verification_status created_at updated_at tags tech_stack applied_in aliases
ai-rlhf-dpo-basics RLHF / DPO — alignment 기초 Coding draft B conceptual 2026-05-09 2026-05-09
ai
alignment
rlhf
vibe-coding
language applicable_to
Python
AI
RLHF
DPO
alignment
reward model
preference learning
instruction tuning
RLAIF

RLHF / DPO Basics

Pretrained LLM 가 helpful + harmless 안 됨. Instruction tuning + RLHF / DPO. Human feedback 이 model 다듬기. Modern LLM (GPT, Claude, Llama) 의 핵심.

📖 핵심 개념

  • Pretrain: 다음 token 예측 (raw text).
  • SFT (Supervised Fine-Tune): instruction → response.
  • Reward model: 사람 비교 → score.
  • RLHF: PPO 가 model 가 reward 높임.
  • DPO: reward model 없이 직접 (modern).

💻 코드 패턴

Stage 1: Pretraining

Raw text (web, books, code) → predict next token.

Output: Base model (GPT, Llama).
- 큰 (7B-70B+)
- 다음 token 잘.
- "Helpful" 안 — completion 만.

Stage 2: SFT (Supervised)

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained('llama-7b')
tokenizer = AutoTokenizer.from_pretrained('llama-7b')

dataset = load_dataset('instruction-data')
# {'instruction': 'Translate to French', 'input': 'Hello', 'output': 'Bonjour'}

# Format
def format(ex):
    return f"### Instruction: {ex['instruction']}\n### Input: {ex['input']}\n### Response: {ex['output']}"

# Train (가장 simple)
trainer.train()

→ Model 가 "instruction → response" 형식 학습.

Stage 3: Reward model

# Human rate: response A vs B.
# Dataset: {prompt, response_a, response_b, preferred: 'a'}

# Reward model (regression)
reward_model = AutoModelForSequenceClassification.from_pretrained('llama-7b', num_labels=1)

# Train: chosen reward > rejected reward
def loss_fn(chosen_reward, rejected_reward):
    return -F.logsigmoid(chosen_reward - rejected_reward).mean()

→ Model 가 prompt + response → score 반환.

Stage 4: PPO (RLHF)

# Loop:
# 1. Sample prompt
# 2. SFT model 가 response 생성
# 3. Reward model 가 score
# 4. PPO 가 model 가 score 높이도록 update
# 5. KL penalty (SFT 너무 변경 X)

from trl import PPOTrainer

trainer = PPOTrainer(
    model=sft_model,
    ref_model=sft_model_frozen,  # KL ref
    reward_model=reward_model,
    tokenizer=tokenizer,
)
trainer.train()

→ 매우 복잡 + 불안정. 큰 GPU.

DPO (Direct Preference Optimization)

# RLHF 의 simpler alternative.
# Reward model 없이 — preference data 직접.

from trl import DPOTrainer

trainer = DPOTrainer(
    model=sft_model,
    ref_model=sft_model_frozen,
    args=TrainingArguments(...),
    train_dataset=preference_dataset,
    beta=0.1,
)
trainer.train()

→ DPO = SFT + KL + preference loss. Simple + 안정.

DPO loss (직관)

# π = 새 model probability
# π_ref = SFT (frozen) probability

loss = -log(σ(β * (
    log(π(chosen) / π_ref(chosen))
    - log(π(rejected) / π_ref(rejected))
)))

# β = KL strength (typically 0.1 - 0.5)

→ 새 model 가 chosen 가 rejected 보다 더 likely.

Preference dataset

{"prompt": "Tell me a joke", "chosen": "Why did the chicken...", "rejected": "I don't know"}
{"prompt": "Explain DPO", "chosen": "DPO is a method...", "rejected": "It's complex"}

→ 인간 / AI 가 pair 생성.

Constitutional AI (Anthropic)

Step 1: SFT.
Step 2: AI critic 가 자체 review (not human).
Step 3: AI 가 better response 생성.
Step 4: DPO / RLHF on AI-generated preferences.

→ "RLAIF" (RL from AI Feedback).

→ 사람 cost 줄임. Scalable.

Instruction dataset

- Alpaca (52k self-instruct)
- Dolly (15k human)
- OpenAssistant (200k+)
- ShareGPT (real conversation)

→ Public dataset 가 starter.

LoRA (parameter-efficient)

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16, lora_alpha=32,
    target_modules=['q_proj', 'v_proj'],
    lora_dropout=0.05,
)
model = get_peft_model(base_model, config)
# → 1% parameter 만 train.

trainer.train()

→ Full fine-tune (큰 GPU) 의 alternative.

QLoRA

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    'llama-70b',
    quantization_config=bnb_config,
)

→ 70B model 가 single A100 (40GB) 에서 LoRA train.

Eval

# Hellaswag, MMLU, TruthfulQA, ARC
import lm_eval
lm_eval.simple_evaluate(model=model, tasks=['mmlu'])

→ Benchmark suite.

Human eval

Annotators 가 model A vs B 비교.
- Helpfulness
- Honesty
- Harmlessness

→ "Win rate" — A 가 B 이긴 %.

Reward hacking

RLHF model 가 reward 높이는 "trick" 발견.
- 매 답 가 길게 (사람 가 long = better 가정).
- "Sure!" / "Of course!" 시작.
- 안 답 (refuse) 가 안전.

→ Reward model 가 proxy. Calibration 어려움.

KL penalty

KL(π || π_ref) = SFT 와 차이.

너무 큰 = SFT 잊음 (잘 한 거 잃음).
너무 작은 = update 안 됨.

β = 0.1 - 0.5 가 typical.

IPO / KTO / ORPO (DPO alternatives)

IPO (Identity PO): KL divergence 더 안정.
KTO (Kahneman-Tversky): 단일 답 만 (preference X).
ORPO: SFT + DPO combined (1 stage).

→ DPO 의 변형. 매년 새.

Closed vs open model

Closed (GPT-4, Claude): RLHF + 비밀.
Open (Llama, Mistral): SFT + DPO 가 흔한.

→ Open = "instruct" version.
Llama-3-Instruct = Llama-3 + SFT + DPO + ...

Personal fine-tune

Use case:
- Domain-specific (legal, medical)
- Style (brand voice)
- Format (specific JSON)

Data:
- 100-1000 example 가 충분 (LoRA).
- 10000+ 가 strong.

Cost

LoRA SFT (7B, 1k examples): $5-50.
QLoRA (70B): $50-500.
Full fine-tune: $1000-100k.
DPO + LoRA: 2-3x SFT.

→ LoRA 가 99% use case.

When NOT fine-tune?

✓ Generic task = prompt engineering.
✓ Few-shot 가 OK = 데이터 적음.
✓ Domain 매우 specific = fine-tune.

Rule: prompt + RAG 시도. 안 되면 fine-tune.

Tools

- Hugging Face TRL (DPO, PPO)
- Axolotl (config-based fine-tune)
- LLaMA-Factory
- Unsloth (빠른 LoRA)
- OpenAI / Anthropic fine-tune API (managed)

Hugging Face TRL

from trl import SFTTrainer, DPOTrainer

# SFT
trainer = SFTTrainer(model=model, train_dataset=ds, dataset_text_field='text')
trainer.train()

# DPO 
trainer = DPOTrainer(model=sft, ref_model=sft_frozen, train_dataset=pref_ds, beta=0.1)
trainer.train()

Axolotl (config-based)

# config.yml
base_model: meta-llama/Llama-3-8B
adapter: lora
lora_r: 16
datasets:
  - path: alpaca.jsonl
    type: alpaca
sequence_len: 2048
axolotl train config.yml

Multi-modal alignment

Vision-language model (LLaVA, Qwen-VL):
- Image-text pair train
- DPO / RLHF + image

→ 매우 복잡. Open SoTA 가 부족.

🤔 의사결정 기준

상황 추천
Generic task Prompt + RAG
Specific domain LoRA SFT
정밀 alignment DPO (after SFT)
고전 RLHF PPO (복잡)
Cost-sensitive LoRA + QLoRA
Modern simple DPO (direct)
큰 cluster Full fine-tune

안티패턴

  • SFT 없이 RLHF: SFT 가 baseline.
  • RLHF 가 모든 거 fix: prompt + RAG 먼저.
  • Reward hacking: 매 hack signal 도.
  • KL 무시: SFT 잊음.
  • 데이터 적음: overfit.
  • Eval 없음: improvement 모름.
  • Public model + private data leak: privacy.

🤖 LLM 활용 힌트

  • DPO 가 RLHF 보다 simple + 안정.
  • LoRA + QLoRA 가 cost answer.
  • Constitutional AI / RLAIF 가 Anthropic 의 답.
  • Prompt + RAG 시도 → 안 되면 fine-tune.

🔗 관련 문서