Learn how to implement DPO to align LLMs without a reward model. Master the DPO training loop, compare it to RLHF, and optimize your model's preferences.
Previously in this course, we explored Alignment with RLHF: Training Reward Models and PPO, which detailed the complex, multi-stage process of training a separate reward model and using PPO to optimize the policy. While powerful, RLHF is notoriously unstable and computationally expensive. This lesson introduces Direct Preference Optimization (DPO), a paradigm shift that treats alignment as a simple classification problem, bypassing the need for a reward model entirely.
In Alignment with RLHF: Training Reward Models and PPO, you saw that RLHF requires:
DPO simplifies this by mathematically deriving the optimal policy directly from the preference data. It shows that there is a closed-form solution for the optimal policy, allowing us to optimize the LLM using a standard supervised learning objective.
DPO works by optimizing the model to increase the probability of the "chosen" response ($y_w$) while decreasing the probability of the "rejected" response ($y_l$), relative to a reference model ($\pi_{ref}$). The objective function is:
$$L_{DPO} = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]$$
Where:
By minimizing this loss, the model learns to shift probability mass toward the preferred outputs without ever explicitly calculating a scalar reward.
To implement DPO, we need a standard training loop where we calculate the log probabilities of both chosen and rejected sequences for both the active policy and the reference model.
PYTHONimport torch import torch.nn.functional as F def dpo_loss(policy_chosen_logps, policy_rejected_logps, ref_chosen_logps, ref_rejected_logps, beta=0.1): CE9178">""" policy_chosen_logps: Log-probs of chosen responses under current model policy_rejected_logps: Log-probs of rejected responses under current model ref_chosen_logps: Log-probs of chosen responses under reference model ref_rejected_logps: Log-probs of rejected responses under reference model """ # Calculate the log-ratio for the policy and reference policy_logratios = policy_chosen_logps - policy_rejected_logps ref_logratios = ref_chosen_logps - ref_rejected_logps # DPO objective logits = policy_logratios - ref_logratios loss = -F.logsigmoid(beta * logits).mean() return loss
In a production environment, you would integrate this into your training script alongside Fine-tuning Methodologies Overview: Strategies for LLM Adaptation. Since DPO requires evaluating both the active and reference models, memory usage can double. Using Quantized LoRA (QLoRA): Fine-tuning Massive Models on Consumer GPUs is highly recommended here to keep the reference model frozen while training the active policy adapters.
| Feature | RLHF (PPO) | DPO |
|---|---|---|
| Complexity | High (RM + Policy) | Low (Single Policy) |
| Stability | Low (Policy Gradients) | High (Supervised Loss) |
| Compute | High (Multiple models) | Moderate (Two forward passes) |
| Reward Model | Required | Not required |
{"prompt": "...", "chosen": "...", "rejected": "..."} entries.ref_model (frozen) and the active policy_model (trainable).dpo_loss.Direct Preference Optimization provides a mathematically elegant, stable way to perform LLM alignment. By converting the preference learning problem into a simple supervised classification task, we remove the need for complex reward modeling and unstable reinforcement learning loops.
Up next: We will apply these alignment techniques to our running project in Project Milestone: Domain-Specific Fine-Tuning.
Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.
Read moreMaster activation checkpointing to train massive models by trading redundant compute for memory. Learn to implement selective recomputation in your PyTorch pipelines.
Direct Preference Optimization (DPO)