Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 18 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 27, 20264 min read

Alignment with RLHF: Training Reward Models and PPO

Learn how to implement RLHF to align LLMs. We cover training reward models from first principles and the mechanics of PPO policy optimization.

RLHFPPOAlignmentReward ModelingDeep Learningaimachine-learningpython

Previously in this course, we explored Parameter-Efficient Fine-Tuning (LoRA) and Quantized LoRA (QLoRA) to adapt models to specific domains. While those methods excel at teaching models what to know, they don't inherently teach them how to behave according to human intent. That is where RLHF (Reinforcement Learning from Human Feedback) comes in.

In this lesson, we move beyond static supervised fine-tuning (SFT) to dynamic alignment. You’ll learn how to treat the model as an agent, define a reward signal, and optimize that agent using the Proximal Policy Optimization (PPO) algorithm.

The Architecture of Alignment

Alignment is essentially a three-step process:

  1. Supervised Fine-Tuning (SFT): The base model is trained on instruction-response pairs.
  2. Reward Modeling: A separate model learns to score responses based on human preferences.
  3. PPO Optimization: The SFT model is fine-tuned using the Reward Model to maximize rewards without drifting too far from the original model.

1. Training the Reward Model

A reward model is typically a transformer-based regressor. Instead of predicting the next token, it takes a prompt and a response, then outputs a scalar score. We train it on a dataset of preferences—pairs of responses (chosen vs. rejected) for the same prompt—using a pairwise ranking loss:

$$Loss = -\log(\sigma(r_{chosen} - r_{rejected}))$$

This forces the model to assign a higher scalar value to the response preferred by humans.

2. PPO: Policy Optimization

Once the reward model is frozen, we use PPO to update our policy (the LLM). PPO is an "actor-critic" method. In our context:

  • The Actor: The policy model (the LLM) generating text.
  • The Critic: A value model (often initialized from the reward model) that estimates the expected future reward.

PPO is preferred over vanilla policy gradients because it uses a "clipped" objective function, preventing the model from making massive, destructive updates to its weights during training.

Implementing PPO: A Simplified Loop

In a production setting, you would use libraries like TRL (Transformer Reinforcement Learning). Here is the conceptual flow of the PPO update for a single step:

PYTHON
import torch
import torch.nn.functional as F

def ppo_step(model, ref_model, reward_model, batch, eps=0.2):
    # 1. Generate text from the current policy
    queries, responses = batch
    logprobs = get_logprobs(model, queries, responses)
    
    # 2. Get rewards from the frozen reward model
    rewards = reward_model(queries, responses)
    
    # 3. Calculate the ratio(pi_new / pi_old)
    ref_logprobs = get_logprobs(ref_model, queries, responses)
    ratio = torch.exp(logprobs - ref_logprobs)
    
    # 4. Compute the clipped objective
    # This prevents the policy from changing too drastically
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1.0 - eps, 1.0 + eps) * advantages
    loss = -torch.min(surr1, surr2).mean()
    
    return loss

Alignment Dynamics

When training with RLHF, you'll encounter a phenomenon called "Reward Hacking." If the reward model is imperfect, the policy model will find shortcuts to maximize the score—such as being overly polite or repeating specific "trigger" words that the reward model associates with high scores.

To combat this, we include a KL Divergence penalty. We calculate the KL divergence between the output distribution of our current policy and the initial SFT model. If the policy drifts too far from the original, we apply a penalty to the reward:

$$Reward_{final} = Reward_{model} - \beta \cdot KL(Policy_{curr} || Policy_{SFT})$$

Hands-on Exercise

Your task is to implement the KL penalty calculation. Using the torch.distributions module, calculate the KL divergence between two sets of logits:

  1. Define a log_probs_curr tensor and log_probs_ref tensor.
  2. Implement kl_div = (log_probs_curr - log_probs_ref).mean().
  3. How does increasing $\beta$ affect the model's output diversity? Experiment by running the penalty with $\beta \in {0.01, 0.1, 1.0}$.

Common Pitfalls

  • Reward Model Overfitting: If your reward model is trained on a small dataset, the policy model will quickly learn to exploit its blind spots. Always use a hold-out set for the reward model.
  • KL Divergence Decay: Failing to scale the KL penalty properly often leads to "model collapse," where the LLM stops generating coherent language and outputs repetitive, high-reward tokens.
  • Memory Constraints: PPO requires holding four models in memory: the Actor, the Ref-Model, the Reward Model, and the Critic. Use Tensor and Pipeline Parallelism to manage this.

Recap

Alignment via RLHF is the bridge between raw capability and human utility. By training a reward model to interpret preferences and using PPO to constrain policy updates, we can steer LLMs toward safer and more helpful behaviors. Remember that the reward model is the "source of truth"—if it's biased, your model's alignment will be biased too.

Up next: We will explore Direct Preference Optimization (DPO), a modern alternative to RLHF that simplifies the alignment process by removing the need for an explicit reward model and PPO.

Previous lessonQuantized LoRA (QLoRA)Next lesson Direct Preference Optimization (DPO)
Back to Blog

Similar Posts

AI/MLJune 27, 20264 min read

Direct Preference Optimization (DPO) for LLM Alignment

Learn how to implement DPO to align LLMs without a reward model. Master the DPO training loop, compare it to RLHF, and optimize your model's preferences.

Read more
AI/MLJune 28, 20264 min read

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 18 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20264 min read

Multi-Modal Model Architectures: Integrating Vision and Language

Learn how to build Multimodal Transformer architectures by integrating vision encoders into LLMs. Master cross-modal alignment and multimodal attention.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course