Post-Training Quantization (PTQ): Optimizing Inference Speed

Master Post-Training Quantization (PTQ) to shrink your models and accelerate inference. Learn how to calibrate INT8/FP4 weights without costly retraining.

QuantizationPTQInferenceOptimizationDeep LearningMLOpsaimachine-learningpython

Previously in this course, we covered Project Milestone: Domain-Specific Fine-Tuning, where we adapted our architecture to specific tasks. Now that we have a performant model, we must address the production bottleneck: inference latency and memory footprint.

Post-Training Quantization (PTQ) is the process of converting a pre-trained model's weights from higher-precision formats (like FP32 or BF16) to lower-precision formats (like INT8 or FP4) after training is complete. Unlike Quantized LoRA (QLoRA), which focuses on memory-efficient training, PTQ focuses on optimizing the model for deployment.

Understanding Quantization from First Principles

At its core, quantization is a mapping function $Q: \mathbb{R} \to \mathbb{Z}$. We want to map our continuous weight distribution into a discrete, smaller set of values. The standard linear quantization formula is:

$$Q(x) = \text{round}\left(\frac{x}{S} + Z\right)$$

Where:

$S$ (Scale): A floating-point value that defines the step size.
$Z$ (Zero-point): An integer value that maps the floating-point zero to the quantized integer space.

When we perform PTQ, we aren't changing the model's logic; we are changing its representation. By moving from FP32 (4 bytes per parameter) to INT8 (1 byte per parameter), we immediately reduce the memory footprint by 4x. This is critical for fitting larger models into GPU VRAM or enabling inference on edge hardware.

The Role of Calibration Datasets

The challenge with PTQ is that "clipping" values into a smaller range introduces rounding errors. If you simply round weights to the nearest integer, you lose information.

Calibration is the process of passing a small, representative subset of your training data through the model to observe the distribution of activations. By looking at these activations, we can choose an optimal scale ($S$) and zero-point ($Z$) that minimizes the Signal-to-Quantization-Noise Ratio (SQNR) loss.

Worked Example: Quantizing a Linear Layer

In this example, we use torch.quantization to perform static quantization on a simple linear layer.


PYTHON
import torch
import torch.nn as nn

# 1. Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 10)
        self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

# 2. Setup quantization configuration
model = SimpleModel()
model.eval()
model.qconfig = torch.quantization.get_default_qconfig(CE9178">'fbgemm')

# 3. Prepare and Calibrate
model_prepared = torch.quantization.prepare(model)

# Calibration: Pass representative data
with torch.no_grad():
    for _ in range(10):
        dummy_input = torch.randn(1, 10)
        model_prepared(dummy_input)

# 4. Convert to quantized model
model_int8 = torch.quantization.convert(model_prepared)

print(f"Original model size: {model.fc.weight.element_size() * model.fc.weight.nelement()}")
print(f"Quantized model size: {model_int8.fc._packed_params._packed_weight.element_size() * model_int8.fc._packed_params._packed_weight.nelement()}")

Evaluating the Accuracy vs. Speedup Trade-off

PTQ is not a "free lunch." You must always benchmark the degradation in your downstream metrics (e.g., perplexity, accuracy, or F1-score) against the inference speedup.

Metric	FP32 (Baseline)	INT8 (Quantized)
Memory Footprint	100%	25%
Throughput (tokens/s)	1.0x	1.5x - 2.5x
Model Accuracy	Baseline	Minimal drop (0.1-0.5%)

If the accuracy drop is too high, you may need Quantization-Aware Training (QAT), which simulates quantization errors during a short fine-tuning phase, though that falls outside the scope of pure PTQ.

Hands-on Exercise

Take the model you built in the Project Milestone: Custom Architecture Setup.
Implement a calibration loop using 100 samples from your validation set.
Measure the "Quantization Error" by calculating the Mean Squared Error (MSE) between the output of your FP32 model and the quantized model on the same input.
If the MSE > 0.05, adjust your calibration strategy (e.g., use a more diverse set of data).

Common Pitfalls

Ignoring Activation Outliers: Large activations can saturate the quantization range. If you see high accuracy loss, check if your model has "activation outliers"—large values that force the scale ($S$) to be too large, effectively crushing the precision of smaller, more important values.
Per-tensor vs. Per-channel: Using per-tensor quantization (one scale for the whole layer) is easier but less accurate than per-channel quantization (different scales for each output channel). Always default to per-channel for weights.
Calibration Data Bias: If your calibration dataset doesn't match the distribution of real-world production traffic, the quantization parameters will be sub-optimal, leading to significant performance degradation in production.

Recap

PTQ is a powerful, low-overhead optimization technique. By using calibration data to set optimal scale and zero-point parameters, we can reduce model size by up to 4x with negligible accuracy impact. Always validate your quantized model's accuracy on a representative hold-out set before deploying to production.

Up next: We will explore Model Pruning Techniques to further reduce model density and latency.

Back to Blog

Post-Training Quantization (PTQ): Optimizing Inference Speed

Understanding Quantization from First Principles

The Role of Calibration Datasets

Worked Example: Quantizing a Linear Layer

Evaluating the Accuracy vs. Speedup Trade-off

Hands-on Exercise

Common Pitfalls

Recap

Similar Posts

Project Milestone: Inference Optimization for Production

ONNX Runtime for Cross-Platform Inference: A Practical Guide

Optimized Inference Runtimes: Scaling LLMs with vLLM and PagedAttention