Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 28 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 28, 20264 min read

Post-Training Quantization (PTQ): Optimizing Inference Speed

Master Post-Training Quantization (PTQ) to shrink your models and accelerate inference. Learn how to calibrate INT8/FP4 weights without costly retraining.

QuantizationPTQInferenceOptimizationDeep LearningMLOpsaimachine-learningpython

Previously in this course, we covered Project Milestone: Domain-Specific Fine-Tuning, where we adapted our architecture to specific tasks. Now that we have a performant model, we must address the production bottleneck: inference latency and memory footprint.

Post-Training Quantization (PTQ) is the process of converting a pre-trained model's weights from higher-precision formats (like FP32 or BF16) to lower-precision formats (like INT8 or FP4) after training is complete. Unlike Quantized LoRA (QLoRA), which focuses on memory-efficient training, PTQ focuses on optimizing the model for deployment.

Understanding Quantization from First Principles

At its core, quantization is a mapping function $Q: \mathbb{R} \to \mathbb{Z}$. We want to map our continuous weight distribution into a discrete, smaller set of values. The standard linear quantization formula is:

$$Q(x) = \text{round}\left(\frac{x}{S} + Z\right)$$

Where:

  • $S$ (Scale): A floating-point value that defines the step size.
  • $Z$ (Zero-point): An integer value that maps the floating-point zero to the quantized integer space.

When we perform PTQ, we aren't changing the model's logic; we are changing its representation. By moving from FP32 (4 bytes per parameter) to INT8 (1 byte per parameter), we immediately reduce the memory footprint by 4x. This is critical for fitting larger models into GPU VRAM or enabling inference on edge hardware.

The Role of Calibration Datasets

The challenge with PTQ is that "clipping" values into a smaller range introduces rounding errors. If you simply round weights to the nearest integer, you lose information.

Calibration is the process of passing a small, representative subset of your training data through the model to observe the distribution of activations. By looking at these activations, we can choose an optimal scale ($S$) and zero-point ($Z$) that minimizes the Signal-to-Quantization-Noise Ratio (SQNR) loss.

Worked Example: Quantizing a Linear Layer

In this example, we use torch.quantization to perform static quantization on a simple linear layer.

PYTHON
import torch
import torch.nn as nn

# 1. Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 10)
        self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

# 2. Setup quantization configuration
model = SimpleModel()
model.eval()
model.qconfig = torch.quantization.get_default_qconfig(CE9178">'fbgemm')

# 3. Prepare and Calibrate
model_prepared = torch.quantization.prepare(model)

# Calibration: Pass representative data
with torch.no_grad():
    for _ in range(10):
        dummy_input = torch.randn(1, 10)
        model_prepared(dummy_input)

# 4. Convert to quantized model
model_int8 = torch.quantization.convert(model_prepared)

print(f"Original model size: {model.fc.weight.element_size() * model.fc.weight.nelement()}")
print(f"Quantized model size: {model_int8.fc._packed_params._packed_weight.element_size() * model_int8.fc._packed_params._packed_weight.nelement()}")

Evaluating the Accuracy vs. Speedup Trade-off

PTQ is not a "free lunch." You must always benchmark the degradation in your downstream metrics (e.g., perplexity, accuracy, or F1-score) against the inference speedup.

MetricFP32 (Baseline)INT8 (Quantized)
Memory Footprint100%25%
Throughput (tokens/s)1.0x1.5x - 2.5x
Model AccuracyBaselineMinimal drop (0.1-0.5%)

If the accuracy drop is too high, you may need Quantization-Aware Training (QAT), which simulates quantization errors during a short fine-tuning phase, though that falls outside the scope of pure PTQ.

Hands-on Exercise

  1. Take the model you built in the Project Milestone: Custom Architecture Setup.
  2. Implement a calibration loop using 100 samples from your validation set.
  3. Measure the "Quantization Error" by calculating the Mean Squared Error (MSE) between the output of your FP32 model and the quantized model on the same input.
  4. If the MSE > 0.05, adjust your calibration strategy (e.g., use a more diverse set of data).

Common Pitfalls

  • Ignoring Activation Outliers: Large activations can saturate the quantization range. If you see high accuracy loss, check if your model has "activation outliers"—large values that force the scale ($S$) to be too large, effectively crushing the precision of smaller, more important values.
  • Per-tensor vs. Per-channel: Using per-tensor quantization (one scale for the whole layer) is easier but less accurate than per-channel quantization (different scales for each output channel). Always default to per-channel for weights.
  • Calibration Data Bias: If your calibration dataset doesn't match the distribution of real-world production traffic, the quantization parameters will be sub-optimal, leading to significant performance degradation in production.

Recap

PTQ is a powerful, low-overhead optimization technique. By using calibration data to set optimal scale and zero-point parameters, we can reduce model size by up to 4x with negligible accuracy impact. Always validate your quantized model's accuracy on a representative hold-out set before deploying to production.

Up next: We will explore Model Pruning Techniques to further reduce model density and latency.

Previous lessonProject Milestone: RAG and Agent IntegrationNext lesson Model Pruning Techniques
Back to Blog

Similar Posts

AI/MLJune 28, 20263 min read

Project Milestone: Inference Optimization for Production

Learn to measure latency and throughput, implement vLLM and quantization, and hit sub-100ms inference targets in this critical project milestone.

Read more
AI/MLJune 28, 20263 min read

ONNX Runtime for Cross-Platform Inference: A Practical Guide

Master the export of PyTorch models to ONNX and accelerate your deployment pipeline using ONNX Runtime for high-performance, cross-platform inference.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 28 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20264 min read

Optimized Inference Runtimes: Scaling LLMs with vLLM and PagedAttention

Learn to deploy LLMs with vLLM to maximize serving throughput. We explore how PagedAttention solves the KV cache memory bottleneck for production inference.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course