NLP models tend to be very large and require very high computing power. In order to truly scale in production models must run inference cost efficiently and have lower latency to support faster time to insight and better user experience.
This talk covers various methods for accelerating performance such as compilation and quantization with NVIDIA TensorRT and how these can be combined with Neural Architecture Search (NAS) to achieve superior performance at a reduced cloud cost. We share examples of such an acceleration that was obtained during a joint collaboration between NVIDIA and Deci in preparation for an MLPerf submission on A100 GPU.
In this project, a significant performance boost was demonstrated with a 66% reduction in model size, a 3.5x throughput increase, and a 0.3 increase in F1 accuracy compared to the baseline BERT model. Session participants will walk away with a better understanding of the tool set that can be applied to quickly achieve their inference performance targets.
If you want to learn more, talk with one of our experts here.