Video

Webinar: How to Speed Up NLP Inference Performance on NVIDIA GPUs

In production, inference performance matters tremendously as it directly translates to operational costs. So how can you improve it?

In a recent joint project, NVIDIA and Deci collaborated on accelerating a question-answering NLP model on NVIDIA A100 GPUs.

With NLP practitioners frequently reaching out to tried and tested model architectures such as BERT-Large, Najeeb and Adam illustrate how developers can substantially accelerate the inference performance of any model (in this case BERT Large) – while maintaining model accuracy to maximize utilization during inference and reduce inference costs.

Main Learnings:

  • Best practices for NLP models selection and design
  • TensorRT Plugins and how to use them correctly
  • Post Training Quantization vs Quantization Aware Training and best practices for INT8 quantization

Watch now, and if you want to learn more about accelerating the inference performance of your NLP use case? Book a demo here.

You May Also Like

[Webinar] How to Speed Up YOLO Models on Snapdragon: Beyond Naive Quantization

[Webinar] How to Evaluate LLMs: Benchmarks, Vibe Checks, Judges, and Beyond

[Webinar] How to Boost Accuracy & Speed in Satellite & Aerial Image Object Detection

Share
Add Your Heading Text Here
				
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")