Video

GTC Talk: How to Accelerate NLP Performance on GPU with Neural Architecture Search

NLP models tend to be very large and require very high computing power. In order to truly scale in production models must run inference cost efficiently and have lower latency to support faster time to insight and better user experience.

This talk covers various methods for accelerating performance such as compilation and quantization with NVIDIA TensorRT and how these can be combined with Neural Architecture Search (NAS) to achieve superior performance at a reduced cloud cost. We share examples of such an acceleration that was obtained during a joint collaboration between NVIDIA and Deci in preparation for an MLPerf submission on A100 GPU.

In this project, a significant performance boost was demonstrated with a 66% reduction in model size, a 3.5x throughput increase, and a 0.3 increase in F1 accuracy compared to the baseline BERT model. Session participants will walk away with a better understanding of the tool set that can be applied to quickly achieve their inference performance targets.

If you want to learn more, talk with one of our experts here.

You May Also Like

[Webinar] How to Speed Up YOLO Models on Snapdragon: Beyond Naive Quantization

[Webinar] How to Evaluate LLMs: Benchmarks, Vibe Checks, Judges, and Beyond

[Webinar] How to Boost Accuracy & Speed in Satellite & Aerial Image Object Detection

Share
Add Your Heading Text Here
				
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")