Video

GTC Talk: How to Accelerate NLP Performance on GPU with Neural Architecture Search

NLP models tend to be very large and require very high computing power. In order to truly scale in production models must run inference cost efficiently and have lower latency to support faster time to insight and better user experience.

This talk covers various methods for accelerating performance such as compilation and quantization with NVIDIA TensorRT and how these can be combined with Neural Architecture Search (NAS) to achieve superior performance at a reduced cloud cost. We share examples of such an acceleration that was obtained during a joint collaboration between NVIDIA and Deci in preparation for an MLPerf submission on A100 GPU.

In this project, a significant performance boost was demonstrated with a 66% reduction in model size, a 3.5x throughput increase, and a 0.3 increase in F1 accuracy compared to the baseline BERT model. Session participants will walk away with a better understanding of the tool set that can be applied to quickly achieve their inference performance targets.

If you want to learn more, talk with one of our experts here.

You May Also Like

Webinar: How to Accelerate DL Inference on NVIDIA® Jetson Orin™

Webinar: 5 Factors to Consider in Developing Deep Learning Projects

Deci 2023 Winter Release – Product Event

The Ultimate Guide to Inference Acceleration of Deep Learning-Based Applications

Learn 12 inference acceleration techniques that you can immediately implement to improve the speed, efficiency, and accuracy of your existing AI models.