How to Convert a PyTorch Model to TensorRT™ and Deploy it in 10 Minutes
This post explains how to convert a PyTorch model to NVIDIA’s TensorRT™ model, in just 10 minutes. It’s simple and you don’t need any prior knowledge.
Why Should You Convert to TensorRT™?
TensorRT™ is a machine learning framework for NVIDIA’s GPUs. It is built on CUDA, NVIDIA’s parallel programming model. When applied, it can deliver around 4 to 5 times faster inference than the baseline model.
In this tutorial, converting a model from PyTorch to TensorRT™ involves the following general steps:
1. Build a PyTorch model by doing any of the two options:
- Train a model in PyTorch
- Get a pre-trained model from the PyTorch ModelZoo, other model repository, or directly from Deci’s SuperGradients, an open-source PyTorch-based deep learning training library.
2. Convert the PyTorch model to ONNX.
3. Convert from ONNX to TensorRT™.
Steps 1 and 2 are general and can be accomplished with relative ease. When we get to Step 3, we’ll show you how to get through it easily using the Deci platform.
Conversion the Fast Way Using the Deci Platform
Deci developed an end-to-end platform that enables AI developers to build, optimize, and deploy blazing-fast deep learning models on any hardware. The Deci platform offers faster performance, better accuracy, shorter development times, powerful optimization features, a visual dashboard for benchmarking and comparing models, and easy deployment. You can sign up for free here.
Using the Deci Platform for Fast Conversion to TensorRT™
We’ll start by converting our PyTorch model to ONNX model. This can be done in minutes using less than 10 lines of code.
Once you have the ONNX model ready, our next step is to save the model to the Deci platform, for example “resnet50_dynamic.onnx”.
Now it’s time to upload the model to the Deci platform.
Sign in to the platform, or sign up if you haven’t yet done that. Once you log in, go to the lab section and click “New Model”.
In the form displayed, fill in the model name, description, type of task (e.g., in our case it is a classification task), hardware on which the model is to be optimized, inference batch_size, framework (ONNX), and input dimension for the model. Finally, give the path to the model and click “Done” to upload the model.
The model is now uploaded onto the platform.
Once the model is uploaded, you can optimize it by selecting the model from the list and clicking “Optimize”. You should see a pop-up like the one shown here.
Make sure the correct model name is selected from the dropdown, choose the target hardware and batch_size, and click “Next”.
We’ll set the quantization level as 16 bit and click “Start Optimization”.
A progress bar indicates that it should take just a few minutes to optimize for the target hardware.
A new model appears in the list with a TRT8 tag, indicating that it is optimized for the latest TensorRT™ version – 8.
One excellent feature of the Deci platform is the option to compare both models using different metrics, such as latency, throughput, memory consumption, or model size.
The Deci platform also makes it easy to compare performance to the original baseline model. We can compare multiple versions of the same model using any of the available metrics.
The table below summarizes the optimization results and proves that the optimized TensorRT™ model is better at inference in every way.
Now that the conversion and optimization are completed you can easily deploy the model by leveraging additional capabilities that are available on the Deci platform.
To deploy the model simply click “Deploy” at the top right corner.
There are two deployment options:
- Infery: Infery is Deci’s proprietary deep-learning run-time inference engine, which can turn a model into an efficient runtime server and enable you to run a model from a Python package.
- RTiC: Runtime Inference Container (RTiC) is Deci’s proprietary containerized deep-learning run-time inference engine, which turns a model into an efficient run-time server. It enables efficient inference and seamless deployment, at scale, on any hardware.
In this blog we will explore Infery inference engine to test our model.
Why Use Infery
- Infery is framework Agnostic. It provides one interface for all deep learning frameworks. Once you implement the inference logic then you are free to change the model ‘backend’ or ‘real framework’ without any development effort and no changes to your code.
- Support Matrix. Infery supports all deep learning frameworks.
- Installing TensorRT™ is hard. The most irritating thing while converting a model to TensorRT™ is the installation of different dependencies and broken environments while installing it. It takes a day or two to get to the correct installation with headaches. Infery’s installation makes this process as easy as possible, sometimes installing these drivers for you in a cross-platform solution, and reducing the installation and environment setup burden.
After selecting the Infery inference engine. We will see a pop like this.
Here you will find instructions on how to download the model and how to install the Infery library on the destination inference engine.
Here is a reference for all the prerequisites for installation of the Infery library.
After meeting all the criterias you can install it by following the instructions mentioned then load the model and test it.
You can test it in any python console. Just feed your model instance with a numpy array and take a look at the outputs. The outputs will be represented as a list of np.ndarray objects.
You can choose to receive the outputs as a list of torch.cuda.Tensor objects by specifying output_device=’gpu’. This will keep the data on the GPU without copying it to the CPU unnecessarily.
Now you can benchmark the model using the benchmark function of Infery to see if all the metrics are as expected.
In this example, you can see that all the metrics are as expected from the Deci platform.
This article illustrates how you can speed up the process of converting a PyTorch model to TensorRT™ model with hassle-free installation as well as deploy it with simple few lines of code using the Deci platform and the Infery inference engine. It’s faster, optimized, and has no computational cost.