ONNX Runtime for Cross-Platform Inference: A Practical Guide

Master the export of PyTorch models to ONNX and accelerate your deployment pipeline using ONNX Runtime for high-performance, cross-platform inference.

ONNXInferenceDeploymentCross-PlatformDeep LearningMLOpsaimachine-learningpython

Previously in this course, we explored TensorRT-LLM for High-Performance Serving: Engine Optimization to push NVIDIA hardware to its limits. While specialized runtimes are excellent for specific hardware, you often need a portable, lightweight solution that works across CPUs, mobile devices, and diverse cloud environments. This lesson introduces ONNX (Open Neural Network Exchange) and the ONNX Runtime (ORT), the industry standard for cross-platform model deployment.

Why ONNX?

PyTorch is fantastic for research and training, but its heavy dependency graph makes it suboptimal for production edge devices or lightweight services. ONNX acts as a common intermediate representation (IR). By serializing your computational graph into a static file, you decoupling the model from the framework, allowing you to run it via C++, C#, Java, or Python using the highly optimized ONNX Runtime.

Exporting PyTorch to ONNX

The export process translates your dynamic PyTorch graph into a static ONNX graph. This requires a dummy input tensor to trace the flow of data through your model layers.


PYTHON
import torch
import torch.onnx

# Assume CE9178">'model' is your trained Transformer block from our course project
model.eval()
dummy_input = torch.randn(1, 512) # Matches your model input shape

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=14,
    do_constant_folding=True,
    input_names=[CE9178">'input_ids'],
    output_names=[CE9178">'logits'],
    dynamic_axes={CE9178">'input_ids': {0: CE9178">'batch_size', 1: CE9178">'seq_len'}}
)

Key Parameters:

opset_version: Choose wisely. Higher versions support more recent operators (like those used in Transformers) but may limit compatibility with older runtimes.
dynamic_axes: Essential for production. Without this, your model will only accept the exact shape of your dummy input. Defining this allows for flexible batch sizes and sequence lengths.

Optimizing the Graph

Once exported, the model is essentially a static file. Before deployment, we can use the onnxoptimizer or built-in ORT features to perform constant folding, node fusion, and dead-code elimination.

The most effective "optimization" is often performed at the runtime level. When you load a model with onnxruntime.InferenceSession, you can configure execution providers (EPs).


PYTHON
import onnxruntime as ort

# Configure the runtime to use CPU or specific hardware accelerators
options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Use CPU execution provider(or CE9178">'CUDAExecutionProvider' for GPUs)
session = ort.InferenceSession("model.onnx", sess_options=options, providers=[CE9178">'CPUExecutionProvider'])

# Running Inference
inputs = {session.get_inputs()[0].name: dummy_input.numpy()}
outputs = session.run(None, inputs)

Hands-on Exercise

Take the Transformer model you built in our Project Milestone: Custom Architecture Setup.
Export the model to ONNX using the script above, ensuring you include dynamic_axes for sequence length.
Load the model into ONNX Runtime.
Compare the inference latency of the raw PyTorch model vs. the ONNX model using timeit. You will likely see a significant speedup on CPU-only environments.

Common Pitfalls

Unsupported Operators: Some advanced custom layers (especially complex aten operations) may not map perfectly to ONNX ops. If this happens, you may need to implement a custom ONNX symbolic function or simplify the layer architecture.
Shape Mismatches: If your dynamic_axes are not defined, you will face hard failures when sending inputs that differ even slightly from your dummy tensor.
Precision Loss: While ONNX supports FP16, ensure your model weights are cast correctly before export. Exporting a high-precision model and then quantizing via ORT requires a calibration dataset to prevent accuracy degradation.

Summary

ONNX is the bridge between your training experiments and a robust production system. By converting to ONNX, you gain the ability to deploy your models on hardware where installing the full PyTorch library is impossible or inefficient. Combined with the lessons on Creating an Inference Script: A Practical Guide for Production, you now have the tools to build lightweight, high-speed inference endpoints.

Up next: We will advance our running project by benchmarking latency and throughput to ensure we meet sub-100ms requirements in our Inference Optimization milestone.

Back to Blog

ONNX Runtime for Cross-Platform Inference: A Practical Guide

Why ONNX?

Exporting PyTorch to ONNX

Optimizing the Graph

Hands-on Exercise

Common Pitfalls

Summary

Similar Posts

Optimized Inference Runtimes: Scaling LLMs with vLLM and PagedAttention

Post-Training Quantization (PTQ): Optimizing Inference Speed

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity