GPU is often the default choice for running large deep learning models effectively. But is there a way to optimize inference for CPU performance to achieve GPU-like performance on deep learning models?
This post talks about model efficiency and examines the gap between GPU and CPU inference for deep learning models. You’ll learn how to reduce this gap with the use of various techniques that improve runtime performance, as well as discover better neural architectures for image classification.
Neural Networks in Production are Compute Power Hungry
Since 2014, deep learning models are getting more accurate. At the same time, however, they are getting larger, using more compute power which then makes them slower. In this graph, you can see the evolution of ImageNet models since 2014, starting from AlexNet on the bottom left.
Source: ResearchGate
The y-axis shows you the Top-1 accuracy, while the x-axis shows you the number of operations in giga-flops. The size of the circle is the number of parameters, representing model complexity. This illustrates how as models improve in terms of accuracy, they also become bigger and hungrier for power. And this is just the tip of the iceberg.
OpenAI’s GPT-3, the famous language model, has 175 billion parameters, while GPT-4 is expected to have five hundred times more. One can only guess how big GPT-17 will be in the future.
But it is not all about accuracy. To make deep learning accessible, sustainable, and beneficial to more people, there is a need to find resource-efficient models that can work well in production.
GPU vs CPU Performance in Deep Learning Models
CPUs are everywhere and can serve as more cost-effective options for running AI-based solutions compared to GPUs. However, finding models that are both accurate and can run efficiently on CPUs can be a challenge.
Generally speaking, GPUs are 3X faster than CPUs. Here’s an example to serve as reference for the rest of this post. The next graph shows the latency of an EfficientNet-B2 model on two different hardware: T4 GPU and Intel Cascade Lake CPU. As you can see, a forward pass on a CPU is 3X slower than a forward pass on a GPU.
The Different Components that Affect Runtime Performance
To minimize the CPU and GPU performance gap, we first need to understand what affects the runtime performance of a model. In addition to the architecture of the network and the target inference hardware, there are two other critical components that affect runtime. Model compression methods such as quantization and pruning, and runtime compilers such as OpenVino or TensorRT, which can help optimize the connection between the software, network, and target hardware.
Hardware
Most of the time, the hardware is not a moving part, and is predetermined by the end-use application and business needs. But it is important to note that models may perform differently on different hardware. Look at the following graph.
We benchmarked two famous detection models, the SSD MobileNet and YOLOv5 on two different CPUs. You can see that on an Intel Core I7, the v5 is faster in terms of throughput. But, on the other end, on the Intel Core I5, the SSD is faster.
An important tip: In cases when the inference hardware is already known at the beginning of the project, you should measure the runtime of your candidate architecture before you begin with training. Doing so will help you save precious training time and resources because the runtime of the model usually doesn’t change after training.
Compilation and Quantization
Compilation and quantization techniques can improve runtime, reduce memory footprint, and minimize model size, but they don’t work in a predictable manner across all models.
For instance, one would assume that compressing a model by half would make it twice more efficient. This is almost correct if you only consider the model size. The graph on the left shows the model size for EfficientNet-B2 compiled on OpenVino with different quantization levels. The model size decreases by a major factor with each quantization. However, quantization doesn’t have the same effect on model latency. On the right graph, the latency of EfficientNet-B2 has negligible improvement when quantized to FP16, and even to INT8.
In other words, there is a limit to what hardware can do with quantized models. But using compilation and quantization techniques can help close the performance gap between GPU and CPU for deep learning inference. As seen below, post compilation and quantization, the performance gap, measured in our case in latency is reduced to 2.8X difference.
Many factors and parameters can have a dramatic impact on your inference performance. Each parameter has different needs that can pull the inference performance in different directions. Therefore, it is crucial to probe all the different parameters when you develop your model. You can do this manually, but there is a better way to do it.
Neural Architecture Search for Better Model Selection
Instead of manually probing all the above-mentioned factors that impact inference performance you can leverage Neural Architecture Search (NAS) for better model design. NAS is a class of algorithms that automatically generate neural networks under specific constraints of budget, latency, accuracy, and more. The common NAS approach uses reinforcement learning and is built around a controller that at each step decides on the optimal changes in each output. The MobileNet and EfficientNet models, for example, were found using a similar approach.
The problem is that this process is very expensive. MobileNet used 800 GPUs for 28 days straight, which amounts to 22,000 GPU hours. EfficientNet used twice as much, reaching 40,000 GPU hours.
Production-Aware Neural Architecture Search
To address the cost and time constraints of NAS, you can consider the approach or solutions based on production-aware NAS. It takes as input the baseline model, data, and inference environment, and then optimizes the architecture to ensure that it meets the inference requirements in production.
Still, there is the challenge of model training, which can take up to two weeks for a single model. With production-aware NAS, specifically Deci’s NAS-based solution called AutoNAC, we can both generate a search space containing millions of architectures and estimate what would be the accuracy of the model after training.
AutoNAC is 100 times faster than the common NAS algorithms. It can run 3X to 4X faster for a single model training. Additionally, it allows the organization to have any performance objective and is applicable to a wide range of domains and tasks.
A New Efficient Frontier for Image Classification Models on Intel’s Cascade Lake CPUs
Having discussed different ways to improve performance using compilation, quantization, and NAS, now is the time to connect all the dots and see an example of results that you can achieve.
In the following graph, we benchmarked a range of classification models for the Intel Cascade Lake CPU and ImageNet data set. The accuracy is on the y-axis, while the latency in milliseconds, is given on the x-axis. Compiled and quantized with OpenVino, you can see state-of-the-art models such as the MobileNet family, RegNet, EfficientNet, and ResNet50.
Deci’s AutoNAC engine automatically generated a series of classification models called DeciNets (highlighted in dark blue on this chart) that outperform the well-known models, both in terms of accuracy and latency when running inference on Intel’s Cascade Lake CPU. If you look at EfficientB3, DeciNet-7 improved it by 2.4X. The graph also shows how EfficientNet-B2 or ResNet50 Strike Back was improved by 2X, and EfficientNet-B1 by 3X, while preserving accuracy.
Can You Close the CPU-GPU Performance Gap for Deep Learning Models?
By leveraging Deci’s AutoNAC technology and hardware-specific optimization, the gap between a model’s inference performance on a GPU versus a CPU is cut in half, without sacrificing the model’s accuracy.
2023 Update: Breakthrough Inference Performance on Intel’s 4th Gen Intel Xeon Scalable Processors
We utilized Deci’s AutoNAC technology again to generate custom hardware-aware model architectures on the Intel Sapphire Rapids CPU. Here are the results we got:
For computer vision, we achieved a 3.35X throughput increase, as well as a 1% accuracy boost, when compared to an INT8 version of a ResNet50 running on Intel Sapphire Rapids.
For NLP, we delivered a 3.5X acceleration compared to the INT8 version of the BERT model on Intel Sapphire Rapids, as well as a +0.1 increase in accuracy. All models were compiled and quantized to INT8 with Intel® Advanced Matrix Extensions (AMX) and Intel extension for PyTorch.
Wrap Up
To get the best performance out of your CPU, it is important to take into account not only the various parameters impacting inference but also production constraints. Hardware awareness early in the model development stage is critical for better model selection and successful performance optimization.
How Can You Boost Your Deep Learning Models’ Performance on CPU?
Here are two ways for deep learning practitioners to get started:
1. Automate the model compilation and quantization for Intel’s CPUs. You can optimize your model with the Deci platform.
2. Get a DeciNet model optimized for CPU and your desired performance requirements. Contact our team to learn more about it.