Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 19 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 27, 20264 min read

Direct Preference Optimization (DPO) for LLM Alignment

Learn how to implement DPO to align LLMs without a reward model. Master the DPO training loop, compare it to RLHF, and optimize your model's preferences.

DPOAlignmentPreference LearningLLMsDeep Learningaimachine-learningpython

Previously in this course, we explored Alignment with RLHF: Training Reward Models and PPO, which detailed the complex, multi-stage process of training a separate reward model and using PPO to optimize the policy. While powerful, RLHF is notoriously unstable and computationally expensive. This lesson introduces Direct Preference Optimization (DPO), a paradigm shift that treats alignment as a simple classification problem, bypassing the need for a reward model entirely.

The Problem with RLHF

In Alignment with RLHF: Training Reward Models and PPO, you saw that RLHF requires:

  1. Training a separate Reward Model (RM) to predict human preferences.
  2. Using Reinforcement Learning (PPO) to optimize the policy against this RM.
  3. Managing the inherent instability of policy gradients and the "reward hacking" problem.

DPO simplifies this by mathematically deriving the optimal policy directly from the preference data. It shows that there is a closed-form solution for the optimal policy, allowing us to optimize the LLM using a standard supervised learning objective.

DPO from First Principles

DPO works by optimizing the model to increase the probability of the "chosen" response ($y_w$) while decreasing the probability of the "rejected" response ($y_l$), relative to a reference model ($\pi_{ref}$). The objective function is:

$$L_{DPO} = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} \right) \right]$$

Where:

  • $\pi_\theta$ is the policy we are training.
  • $\pi_{ref}$ is the frozen, base model (usually the SFT model).
  • $\beta$ is a hyperparameter controlling the strength of the KL-divergence penalty.

By minimizing this loss, the model learns to shift probability mass toward the preferred outputs without ever explicitly calculating a scalar reward.

Worked Example: Implementing the DPO Loss

To implement DPO, we need a standard training loop where we calculate the log probabilities of both chosen and rejected sequences for both the active policy and the reference model.

PYTHON
import torch
import torch.nn.functional as F

def dpo_loss(policy_chosen_logps, policy_rejected_logps, 
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    CE9178">"""
    policy_chosen_logps: Log-probs of chosen responses under current model
    policy_rejected_logps: Log-probs of rejected responses under current model
    ref_chosen_logps: Log-probs of chosen responses under reference model
    ref_rejected_logps: Log-probs of rejected responses under reference model
    """
    # Calculate the log-ratio for the policy and reference
    policy_logratios = policy_chosen_logps - policy_rejected_logps
    ref_logratios = ref_chosen_logps - ref_rejected_logps
    
    # DPO objective
    logits = policy_logratios - ref_logratios
    loss = -F.logsigmoid(beta * logits).mean()
    
    return loss

In a production environment, you would integrate this into your training script alongside Fine-tuning Methodologies Overview: Strategies for LLM Adaptation. Since DPO requires evaluating both the active and reference models, memory usage can double. Using Quantized LoRA (QLoRA): Fine-tuning Massive Models on Consumer GPUs is highly recommended here to keep the reference model frozen while training the active policy adapters.

Comparison: DPO vs. RLHF

FeatureRLHF (PPO)DPO
ComplexityHigh (RM + Policy)Low (Single Policy)
StabilityLow (Policy Gradients)High (Supervised Loss)
ComputeHigh (Multiple models)Moderate (Two forward passes)
Reward ModelRequiredNot required

Hands-on Exercise

  1. Prepare Data: Create a dummy dataset with {"prompt": "...", "chosen": "...", "rejected": "..."} entries.
  2. Setup: Instantiate your base model as the ref_model (frozen) and the active policy_model (trainable).
  3. Training Loop: Modify your existing fine-tuning loop to perform two forward passes per batch (one for policy, one for reference) and compute the dpo_loss.
  4. Evaluation: Compare the log-probability ratios on a held-out evaluation set before and after training.

Common Pitfalls

  • KL Divergence Drift: If $\beta$ is too small, the model may diverge from the base model, leading to "model collapse" where it produces gibberish that happens to be preferred by the reward function.
  • Overfitting: DPO is prone to overfitting on the preference dataset. Ensure you have high-quality, diverse data; noisy preference labels will directly corrupt the policy.
  • Memory Management: Remember that you are running two models in memory. If you run out of VRAM, use Tensor and Pipeline Parallelism: Scaling Large Model Training to distribute the load across GPUs.

Recap

Direct Preference Optimization provides a mathematically elegant, stable way to perform LLM alignment. By converting the preference learning problem into a simple supervised classification task, we remove the need for complex reward modeling and unstable reinforcement learning loops.

Up next: We will apply these alignment techniques to our running project in Project Milestone: Domain-Specific Fine-Tuning.

Previous lessonAlignment with RLHFNext lesson Project Milestone: Domain-Specific Fine-Tuning
Back to Blog

Similar Posts

AI/MLJune 28, 20264 min read

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.

Read more
AI/MLJune 28, 20264 min read

Advanced Activation Checkpointing: Memory Optimization for Deep Learning

Master activation checkpointing to train massive models by trading redundant compute for memory. Learn to implement selective recomputation in your PyTorch pipelines.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 19 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20264 min read

Optimized Inference Runtimes: Scaling LLMs with vLLM and PagedAttention

Learn to deploy LLMs with vLLM to maximize serving throughput. We explore how PagedAttention solves the KV cache memory bottleneck for production inference.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course