--- id: ai-rlhf-dpo-basics title: RLHF / DPO — alignment 기초 category: Coding status: draft source_trust_level: B verification_status: conceptual created_at: 2026-05-09 updated_at: 2026-05-09 tags: [ai, alignment, rlhf, vibe-coding] tech_stack: { language: "Python", applicable_to: ["AI"] } applied_in: [] aliases: [RLHF, DPO, alignment, reward model, preference learning, instruction tuning, RLAIF] --- # RLHF / DPO Basics > Pretrained LLM 가 helpful + harmless 안 됨. **Instruction tuning + RLHF / DPO**. Human feedback 이 model 다듬기. Modern LLM (GPT, Claude, Llama) 의 핵심. ## 📖 핵심 개념 - Pretrain: 다음 token 예측 (raw text). - SFT (Supervised Fine-Tune): instruction → response. - Reward model: 사람 비교 → score. - RLHF: PPO 가 model 가 reward 높임. - DPO: reward model 없이 직접 (modern). ## 💻 코드 패턴 ### Stage 1: Pretraining ``` Raw text (web, books, code) → predict next token. Output: Base model (GPT, Llama). - 큰 (7B-70B+) - 다음 token 잘. - "Helpful" 안 — completion 만. ``` ### Stage 2: SFT (Supervised) ```python from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset model = AutoModelForCausalLM.from_pretrained('llama-7b') tokenizer = AutoTokenizer.from_pretrained('llama-7b') dataset = load_dataset('instruction-data') # {'instruction': 'Translate to French', 'input': 'Hello', 'output': 'Bonjour'} # Format def format(ex): return f"### Instruction: {ex['instruction']}\n### Input: {ex['input']}\n### Response: {ex['output']}" # Train (가장 simple) trainer.train() ``` → Model 가 "instruction → response" 형식 학습. ### Stage 3: Reward model ```python # Human rate: response A vs B. # Dataset: {prompt, response_a, response_b, preferred: 'a'} # Reward model (regression) reward_model = AutoModelForSequenceClassification.from_pretrained('llama-7b', num_labels=1) # Train: chosen reward > rejected reward def loss_fn(chosen_reward, rejected_reward): return -F.logsigmoid(chosen_reward - rejected_reward).mean() ``` → Model 가 prompt + response → score 반환. ### Stage 4: PPO (RLHF) ```python # Loop: # 1. Sample prompt # 2. SFT model 가 response 생성 # 3. Reward model 가 score # 4. PPO 가 model 가 score 높이도록 update # 5. KL penalty (SFT 너무 변경 X) from trl import PPOTrainer trainer = PPOTrainer( model=sft_model, ref_model=sft_model_frozen, # KL ref reward_model=reward_model, tokenizer=tokenizer, ) trainer.train() ``` → 매우 복잡 + 불안정. 큰 GPU. ### DPO (Direct Preference Optimization) ```python # RLHF 의 simpler alternative. # Reward model 없이 — preference data 직접. from trl import DPOTrainer trainer = DPOTrainer( model=sft_model, ref_model=sft_model_frozen, args=TrainingArguments(...), train_dataset=preference_dataset, beta=0.1, ) trainer.train() ``` → DPO = SFT + KL + preference loss. Simple + 안정. ### DPO loss (직관) ```python # π = 새 model probability # π_ref = SFT (frozen) probability loss = -log(σ(β * ( log(π(chosen) / π_ref(chosen)) - log(π(rejected) / π_ref(rejected)) ))) # β = KL strength (typically 0.1 - 0.5) ``` → 새 model 가 chosen 가 rejected 보다 더 likely. ### Preference dataset ```jsonl {"prompt": "Tell me a joke", "chosen": "Why did the chicken...", "rejected": "I don't know"} {"prompt": "Explain DPO", "chosen": "DPO is a method...", "rejected": "It's complex"} ``` → 인간 / AI 가 pair 생성. ### Constitutional AI (Anthropic) ``` Step 1: SFT. Step 2: AI critic 가 자체 review (not human). Step 3: AI 가 better response 생성. Step 4: DPO / RLHF on AI-generated preferences. → "RLAIF" (RL from AI Feedback). ``` → 사람 cost 줄임. Scalable. ### Instruction dataset ``` - Alpaca (52k self-instruct) - Dolly (15k human) - OpenAssistant (200k+) - ShareGPT (real conversation) → Public dataset 가 starter. ``` ### LoRA (parameter-efficient) ```python from peft import LoraConfig, get_peft_model config = LoraConfig( r=16, lora_alpha=32, target_modules=['q_proj', 'v_proj'], lora_dropout=0.05, ) model = get_peft_model(base_model, config) # → 1% parameter 만 train. trainer.train() ``` → Full fine-tune (큰 GPU) 의 alternative. ### QLoRA ```python from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16, ) model = AutoModelForCausalLM.from_pretrained( 'llama-70b', quantization_config=bnb_config, ) ``` → 70B model 가 single A100 (40GB) 에서 LoRA train. ### Eval ```python # Hellaswag, MMLU, TruthfulQA, ARC import lm_eval lm_eval.simple_evaluate(model=model, tasks=['mmlu']) ``` → Benchmark suite. ### Human eval ``` Annotators 가 model A vs B 비교. - Helpfulness - Honesty - Harmlessness → "Win rate" — A 가 B 이긴 %. ``` ### Reward hacking ``` RLHF model 가 reward 높이는 "trick" 발견. - 매 답 가 길게 (사람 가 long = better 가정). - "Sure!" / "Of course!" 시작. - 안 답 (refuse) 가 안전. → Reward model 가 proxy. Calibration 어려움. ``` ### KL penalty ``` KL(π || π_ref) = SFT 와 차이. 너무 큰 = SFT 잊음 (잘 한 거 잃음). 너무 작은 = update 안 됨. β = 0.1 - 0.5 가 typical. ``` ### IPO / KTO / ORPO (DPO alternatives) ``` IPO (Identity PO): KL divergence 더 안정. KTO (Kahneman-Tversky): 단일 답 만 (preference X). ORPO: SFT + DPO combined (1 stage). → DPO 의 변형. 매년 새. ``` ### Closed vs open model ``` Closed (GPT-4, Claude): RLHF + 비밀. Open (Llama, Mistral): SFT + DPO 가 흔한. → Open = "instruct" version. Llama-3-Instruct = Llama-3 + SFT + DPO + ... ``` ### Personal fine-tune ``` Use case: - Domain-specific (legal, medical) - Style (brand voice) - Format (specific JSON) Data: - 100-1000 example 가 충분 (LoRA). - 10000+ 가 strong. ``` ### Cost ``` LoRA SFT (7B, 1k examples): $5-50. QLoRA (70B): $50-500. Full fine-tune: $1000-100k. DPO + LoRA: 2-3x SFT. → LoRA 가 99% use case. ``` ### When NOT fine-tune? ``` ✓ Generic task = prompt engineering. ✓ Few-shot 가 OK = 데이터 적음. ✓ Domain 매우 specific = fine-tune. Rule: prompt + RAG 시도. 안 되면 fine-tune. ``` ### Tools ``` - Hugging Face TRL (DPO, PPO) - Axolotl (config-based fine-tune) - LLaMA-Factory - Unsloth (빠른 LoRA) - OpenAI / Anthropic fine-tune API (managed) ``` ### Hugging Face TRL ```python from trl import SFTTrainer, DPOTrainer # SFT trainer = SFTTrainer(model=model, train_dataset=ds, dataset_text_field='text') trainer.train() # DPO trainer = DPOTrainer(model=sft, ref_model=sft_frozen, train_dataset=pref_ds, beta=0.1) trainer.train() ``` ### Axolotl (config-based) ```yaml # config.yml base_model: meta-llama/Llama-3-8B adapter: lora lora_r: 16 datasets: - path: alpaca.jsonl type: alpaca sequence_len: 2048 ``` ```bash axolotl train config.yml ``` ### Multi-modal alignment ``` Vision-language model (LLaVA, Qwen-VL): - Image-text pair train - DPO / RLHF + image → 매우 복잡. Open SoTA 가 부족. ``` ## 🤔 의사결정 기준 | 상황 | 추천 | |---|---| | Generic task | Prompt + RAG | | Specific domain | LoRA SFT | | 정밀 alignment | DPO (after SFT) | | 고전 RLHF | PPO (복잡) | | Cost-sensitive | LoRA + QLoRA | | Modern simple | DPO (direct) | | 큰 cluster | Full fine-tune | ## ❌ 안티패턴 - **SFT 없이 RLHF**: SFT 가 baseline. - **RLHF 가 모든 거 fix**: prompt + RAG 먼저. - **Reward hacking**: 매 hack signal 도. - **KL 무시**: SFT 잊음. - **데이터 적음**: overfit. - **Eval 없음**: improvement 모름. - **Public model + private data leak**: privacy. ## 🤖 LLM 활용 힌트 - DPO 가 RLHF 보다 simple + 안정. - LoRA + QLoRA 가 cost answer. - Constitutional AI / RLAIF 가 Anthropic 의 답. - Prompt + RAG 시도 → 안 되면 fine-tune. ## 🔗 관련 문서 - [[AI_Fine_Tuning_vs_Prompting]] - [[AI_LLM_Eval_Patterns]] - [[AI_Safety_Patterns]]