Master Post-Training Quantization (PTQ) to shrink your models and accelerate inference. Learn how to calibrate INT8/FP4 weights without costly retraining.
Previously in this course, we covered Project Milestone: Domain-Specific Fine-Tuning, where we adapted our architecture to specific tasks. Now that we have a performant model, we must address the production bottleneck: inference latency and memory footprint.
Post-Training Quantization (PTQ) is the process of converting a pre-trained model's weights from higher-precision formats (like FP32 or BF16) to lower-precision formats (like INT8 or FP4) after training is complete. Unlike Quantized LoRA (QLoRA), which focuses on memory-efficient training, PTQ focuses on optimizing the model for deployment.
At its core, quantization is a mapping function $Q: \mathbb{R} \to \mathbb{Z}$. We want to map our continuous weight distribution into a discrete, smaller set of values. The standard linear quantization formula is:
$$Q(x) = \text{round}\left(\frac{x}{S} + Z\right)$$
Where:
When we perform PTQ, we aren't changing the model's logic; we are changing its representation. By moving from FP32 (4 bytes per parameter) to INT8 (1 byte per parameter), we immediately reduce the memory footprint by 4x. This is critical for fitting larger models into GPU VRAM or enabling inference on edge hardware.
The challenge with PTQ is that "clipping" values into a smaller range introduces rounding errors. If you simply round weights to the nearest integer, you lose information.
Calibration is the process of passing a small, representative subset of your training data through the model to observe the distribution of activations. By looking at these activations, we can choose an optimal scale ($S$) and zero-point ($Z$) that minimizes the Signal-to-Quantization-Noise Ratio (SQNR) loss.
In this example, we use torch.quantization to perform static quantization on a simple linear layer.
PYTHONimport torch import torch.nn as nn # 1. Define a simple model class SimpleModel(nn.Module): def __init__(self): super().__init__() self.fc = nn.Linear(10, 10) self.quant = torch.quantization.QuantStub() self.dequant = torch.quantization.DeQuantStub() def forward(self, x): x = self.quant(x) x = self.fc(x) x = self.dequant(x) return x # 2. Setup quantization configuration model = SimpleModel() model.eval() model.qconfig = torch.quantization.get_default_qconfig(CE9178">'fbgemm') # 3. Prepare and Calibrate model_prepared = torch.quantization.prepare(model) # Calibration: Pass representative data with torch.no_grad(): for _ in range(10): dummy_input = torch.randn(1, 10) model_prepared(dummy_input) # 4. Convert to quantized model model_int8 = torch.quantization.convert(model_prepared) print(f"Original model size: {model.fc.weight.element_size() * model.fc.weight.nelement()}") print(f"Quantized model size: {model_int8.fc._packed_params._packed_weight.element_size() * model_int8.fc._packed_params._packed_weight.nelement()}")
PTQ is not a "free lunch." You must always benchmark the degradation in your downstream metrics (e.g., perplexity, accuracy, or F1-score) against the inference speedup.
| Metric | FP32 (Baseline) | INT8 (Quantized) |
|---|---|---|
| Memory Footprint | 100% | 25% |
| Throughput (tokens/s) | 1.0x | 1.5x - 2.5x |
| Model Accuracy | Baseline | Minimal drop (0.1-0.5%) |
If the accuracy drop is too high, you may need Quantization-Aware Training (QAT), which simulates quantization errors during a short fine-tuning phase, though that falls outside the scope of pure PTQ.
PTQ is a powerful, low-overhead optimization technique. By using calibration data to set optimal scale and zero-point parameters, we can reduce model size by up to 4x with negligible accuracy impact. Always validate your quantized model's accuracy on a representative hold-out set before deploying to production.
Up next: We will explore Model Pruning Techniques to further reduce model density and latency.
Learn to measure latency and throughput, implement vLLM and quantization, and hit sub-100ms inference targets in this critical project milestone.
Read moreMaster the export of PyTorch models to ONNX and accelerate your deployment pipeline using ONNX Runtime for high-performance, cross-platform inference.
Post-Training Quantization (PTQ)