--- id: wiki-2026-0508-dpo title: DPO (Direct Preference Optimization) category: 10_Wiki/Topics status: verified canonical_id: self aliases: [DPO, ORPO, SimPO, IPO, KTO, preference learning, RLHF alternative, alignment] duplicate_of: none source_trust_level: A confidence_score: 0.95 verification_status: applied tags: [llm, alignment, dpo, rlhf, preference-optimization, ppo, fine-tuning, trl] raw_sources: [] last_reinforced: 2026-05-10 github_commit: pending tech_stack: language: Python framework: TRL / DeepSpeed / Axolotl --- # DPO (Direct Preference Optimization) ## 매 한 줄 > **"매 reward model 의 X — 매 preference pair 의 직접 학습"**. Rafailov et al. 2023. 매 RLHF 의 simpler + 매 stable + 매 effective alternative. 매 modern variant: ORPO, SimPO, KTO. 매 Llama-3, Tülu, 매 most open model 의 standard. ## 매 핵심 ### 매 vs RLHF | 측면 | RLHF (PPO) | DPO | |---|---|---| | Reward model | Required | None | | Stages | 2 (RM → PPO) | 1 | | Stability | Hard | Stable | | Hyperparameter | Many | Few | | Compute | High | Lower | | Quality | Strong | Comparable | ### 매 DPO loss $$L_{DPO} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)$$ - 매 y_w = 매 winner (preferred). - 매 y_l = 매 loser. - 매 β = 매 KL coefficient. ### 매 derivation insight - 매 PPO objective 의 closed-form: 매 reward = 매 log-probability ratio. - 매 reward model 의 implicitly learned by 매 policy. ### 매 variant #### ORPO (Odds Ratio PO) - 매 reference model 의 free. - 매 SFT + 매 preference 의 single stage. #### SimPO (Simple PO) - 매 reference 의 free. - 매 length-normalize. #### IPO (Identity PO) - 매 DPO 의 deterministic preference 의 fix. #### KTO (Kahneman-Tversky Optimization) - 매 binary feedback (good / bad) — 매 pair 없이. - 매 prospect theory inspired. #### SLiC-HF - 매 sequence-level contrastive. #### CPO (Contrastive PO) - 매 reference-free + length-aware. ### 매 data - **HH-RLHF** (Anthropic): 매 helpful + harmless. - **UltraFeedback**. - **Nectar**. - **PKU-Beaver**. - **Tülu**. ### 매 modern stack - **TRL** (HuggingFace): 매 DPOTrainer, ORPOTrainer. - **Axolotl**: 매 config-driven. - **DeepSpeed**. - **Unsloth**: 매 fast LoRA. ### 매 응용 1. **LLM alignment**: 매 helpful + harmless. 2. **Fine-tune on preference**: 매 customer service tone. 3. **Code style**: 매 specific convention. 4. **Refusal calibration**: 매 over-refusal 의 reduce. ### 매 limitation - 매 reference model 의 quality 의 critical. - 매 length bias (longer 의 win 의 tend). - 매 over-conservative (mode-seeking). - 매 verifier-based (RLVR) 가 매 specific 의 better. ## 💻 패턴 ### DPOTrainer (TRL) ```python from trl import DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset model = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-v0.1') ref_model = AutoModelForCausalLM.from_pretrained('mistralai/Mistral-7B-v0.1') tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1') # 매 dataset format: {prompt, chosen, rejected} dataset = load_dataset('trl-lib/ultrafeedback_binarized') config = DPOConfig( output_dir='./dpo-mistral', per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=5e-7, num_train_epochs=1, beta=0.1, # 매 KL coefficient max_length=2048, max_prompt_length=512, ) trainer = DPOTrainer( model=model, ref_model=ref_model, args=config, train_dataset=dataset['train'], tokenizer=tokenizer, ) trainer.train() ``` ### Manual DPO loss ```python import torch import torch.nn.functional as F def dpo_loss(policy_logp_chosen, policy_logp_rejected, ref_logp_chosen, ref_logp_rejected, beta=0.1): pi_logratio = policy_logp_chosen - policy_logp_rejected ref_logratio = ref_logp_chosen - ref_logp_rejected return -F.logsigmoid(beta * (pi_logratio - ref_logratio)).mean() # 매 logp = log P(y | x) ``` ### ORPO (no reference) ```python from trl import ORPOTrainer, ORPOConfig config = ORPOConfig( output_dir='./orpo-mistral', learning_rate=8e-6, beta=0.1, # 매 odds ratio coefficient ) # 매 매 SFT + preference 의 single stage trainer = ORPOTrainer(model=model, args=config, train_dataset=dataset, tokenizer=tokenizer) trainer.train() ``` ### KTO (binary feedback) ```python from trl import KTOTrainer, KTOConfig # 매 dataset format: {prompt, completion, label: True / False} config = KTOConfig( output_dir='./kto', beta=0.1, desirable_weight=1.0, undesirable_weight=1.0, ) trainer = KTOTrainer(model=model, ref_model=ref_model, args=config, train_dataset=dataset) trainer.train() ``` ### LoRA + DPO (efficient) ```python from peft import LoraConfig peft_config = LoraConfig( r=16, lora_alpha=32, target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], lora_dropout=0.05, task_type='CAUSAL_LM', ) config = DPOConfig(...) trainer = DPOTrainer( model=model, args=config, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, # 매 LoRA ) ``` ### Preference data generation (synthetic) ```python def generate_preference_pair(prompt, model_a, model_b, judge): response_a = model_a.generate(prompt) response_b = model_b.generate(prompt) chosen, rejected = judge(prompt, response_a, response_b) return {'prompt': prompt, 'chosen': chosen, 'rejected': rejected} # 매 LLM-as-judge def gpt4_judge(prompt, a, b): judgment = gpt4.generate(f"""Which response is better? Prompt: {prompt} A: {a} B: {b} Reply: A or B""") return (a, b) if 'A' in judgment else (b, a) ``` ### Length bias mitigation ```python # 매 SimPO: 매 length-normalized def simpo_loss(logp_chosen, logp_rejected, len_chosen, len_rejected, beta=0.5, gamma=1.0): pi_chosen = logp_chosen / len_chosen # 매 normalize pi_rejected = logp_rejected / len_rejected return -F.logsigmoid(beta * (pi_chosen - pi_rejected) - gamma).mean() ``` ### Eval (preference accuracy) ```python def eval_preference_accuracy(model, eval_set): correct = 0 for ex in eval_set: logp_chosen = model.compute_logp(ex['chosen']) logp_rejected = model.compute_logp(ex['rejected']) if logp_chosen > logp_rejected: correct += 1 return correct / len(eval_set) ``` ## 매 결정 기준 | 상황 | Method | |---|---| | Standard alignment | DPO | | Single-stage | ORPO | | Length-sensitive | SimPO | | Binary feedback | KTO | | Verifiable reward | RLHF (PPO) or RLVR | | Limited compute | DPO + LoRA | | Open dataset | UltraFeedback / HH-RLHF | | Tone fine-tune | DPO + custom pairs | **기본값**: DPO + LoRA (efficient) + UltraFeedback. ## 🔗 Graph - 부모: [[Fine-Tuning]] · [[Preference-Learning]] - 변형: [[ORPO]] · [[SimPO]] · [[KTO]] · [[IPO]] - 응용: [[Axolotl]] · [[Llama]] - Adjacent: [[RLHF]] · [[AI_Safety_and_Alignment|Constitutional-AI]] · [[Best-of-N_Sampling]] · [[Credit Assignment Problem]] · [[Cross-Entropy Loss]] ## 🤖 LLM 활용 **언제**: 매 LLM alignment. 매 customer-specific tone. 매 RLHF alternative. 매 fine-tune at scale. **언제 X**: 매 verifiable task (RLVR / process reward). 매 small data (SFT 의 enough). ## ❌ 안티패턴 - **No reference model** (DPO): 매 over-fit. - **β too high**: 매 underutilize preference. - **β too low**: 매 reference 의 drift. - **Length bias 의 ignore**: 매 long answer 의 win. - **Single-pair training**: 매 noisy. - **No SFT first**: 매 quality drop. ## 🧪 검증 / 중복 - Verified (Rafailov et al. 2023 DPO, ORPO 2024, KTO 2024). - 신뢰도 A. - Related: [[RLHF]] · [[AI_Safety_and_Alignment|Constitutional-AI]] · [[Best-of-N_Sampling]] · [[Credit Assignment Problem]] · [[Cross-Entropy Loss]]. ## 🕓 Changelog | 날짜 | 변경 | |---|---| | 2026-05-08 | Phase 1 | | 2026-05-10 | Manual cleanup — DPO formula + variants + 매 TRL / ORPO / KTO / LoRA / SimPO code |