Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 33 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 28, 20263 min read

ONNX Runtime for Cross-Platform Inference: A Practical Guide

Master the export of PyTorch models to ONNX and accelerate your deployment pipeline using ONNX Runtime for high-performance, cross-platform inference.

ONNXInferenceDeploymentCross-PlatformDeep LearningMLOpsaimachine-learningpython

Previously in this course, we explored TensorRT-LLM for High-Performance Serving: Engine Optimization to push NVIDIA hardware to its limits. While specialized runtimes are excellent for specific hardware, you often need a portable, lightweight solution that works across CPUs, mobile devices, and diverse cloud environments. This lesson introduces ONNX (Open Neural Network Exchange) and the ONNX Runtime (ORT), the industry standard for cross-platform model deployment.

Why ONNX?

PyTorch is fantastic for research and training, but its heavy dependency graph makes it suboptimal for production edge devices or lightweight services. ONNX acts as a common intermediate representation (IR). By serializing your computational graph into a static file, you decoupling the model from the framework, allowing you to run it via C++, C#, Java, or Python using the highly optimized ONNX Runtime.

Exporting PyTorch to ONNX

The export process translates your dynamic PyTorch graph into a static ONNX graph. This requires a dummy input tensor to trace the flow of data through your model layers.

PYTHON
import torch
import torch.onnx

# Assume CE9178">'model' is your trained Transformer block from our course project
model.eval()
dummy_input = torch.randn(1, 512) # Matches your model input shape

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=14,
    do_constant_folding=True,
    input_names=[CE9178">'input_ids'],
    output_names=[CE9178">'logits'],
    dynamic_axes={CE9178">'input_ids': {0: CE9178">'batch_size', 1: CE9178">'seq_len'}}
)

Key Parameters:

  • opset_version: Choose wisely. Higher versions support more recent operators (like those used in Transformers) but may limit compatibility with older runtimes.
  • dynamic_axes: Essential for production. Without this, your model will only accept the exact shape of your dummy input. Defining this allows for flexible batch sizes and sequence lengths.

Optimizing the Graph

Once exported, the model is essentially a static file. Before deployment, we can use the onnxoptimizer or built-in ORT features to perform constant folding, node fusion, and dead-code elimination.

The most effective "optimization" is often performed at the runtime level. When you load a model with onnxruntime.InferenceSession, you can configure execution providers (EPs).

PYTHON
import onnxruntime as ort

# Configure the runtime to use CPU or specific hardware accelerators
options = ort.SessionOptions()
options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Use CPU execution provider(or CE9178">'CUDAExecutionProvider' for GPUs)
session = ort.InferenceSession("model.onnx", sess_options=options, providers=[CE9178">'CPUExecutionProvider'])

# Running Inference
inputs = {session.get_inputs()[0].name: dummy_input.numpy()}
outputs = session.run(None, inputs)

Hands-on Exercise

  1. Take the Transformer model you built in our Project Milestone: Custom Architecture Setup.
  2. Export the model to ONNX using the script above, ensuring you include dynamic_axes for sequence length.
  3. Load the model into ONNX Runtime.
  4. Compare the inference latency of the raw PyTorch model vs. the ONNX model using timeit. You will likely see a significant speedup on CPU-only environments.

Common Pitfalls

  • Unsupported Operators: Some advanced custom layers (especially complex aten operations) may not map perfectly to ONNX ops. If this happens, you may need to implement a custom ONNX symbolic function or simplify the layer architecture.
  • Shape Mismatches: If your dynamic_axes are not defined, you will face hard failures when sending inputs that differ even slightly from your dummy tensor.
  • Precision Loss: While ONNX supports FP16, ensure your model weights are cast correctly before export. Exporting a high-precision model and then quantizing via ORT requires a calibration dataset to prevent accuracy degradation.

Summary

ONNX is the bridge between your training experiments and a robust production system. By converting to ONNX, you gain the ability to deploy your models on hardware where installing the full PyTorch library is impossible or inefficient. Combined with the lessons on Creating an Inference Script: A Practical Guide for Production, you now have the tools to build lightweight, high-speed inference endpoints.

Up next: We will advance our running project by benchmarking latency and throughput to ensure we meet sub-100ms requirements in our Inference Optimization milestone.

Previous lessonTensorRT-LLM for High-Performance ServingNext lesson Project Milestone: Inference Optimization
Back to Blog

Similar Posts

AI/MLJune 28, 20264 min read

Optimized Inference Runtimes: Scaling LLMs with vLLM and PagedAttention

Learn to deploy LLMs with vLLM to maximize serving throughput. We explore how PagedAttention solves the KV cache memory bottleneck for production inference.

Read more
AI/MLJune 28, 20264 min read

Post-Training Quantization (PTQ): Optimizing Inference Speed

Master Post-Training Quantization (PTQ) to shrink your models and accelerate inference. Learn how to calibrate INT8/FP4 weights without costly retraining.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 33 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20264 min read

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course