Video

Webinar: How to Speed Up NLP Inference Performance on NVIDIA GPUs

By Deci
Research Team

March 8, 2023
2 min read

In production, inference performance matters tremendously as it directly translates to operational costs. So how can you improve it?

In a recent joint project, NVIDIA and Deci collaborated on accelerating a question-answering NLP model on NVIDIA A100 GPUs.

With NLP practitioners frequently reaching out to tried and tested model architectures such as BERT-Large, Najeeb and Adam illustrate how developers can substantially accelerate the inference performance of any model (in this case BERT Large) – while maintaining model accuracy to maximize utilization during inference and reduce inference costs.

Main Learnings:

Best practices for NLP models selection and design
TensorRT Plugins and how to use them correctly
Post Training Quantization vs Quantization Aware Training and best practices for INT8 quantization

Watch now, and if you want to learn more about accelerating the inference performance of your NLP use case? Book a demo here.

[Webinar] How to Speed Up YOLO Models on Snapdragon: Beyond Naive Quantization

[Webinar] How to Evaluate LLMs: Benchmarks, Vibe Checks, Judges, and Beyond

[Webinar] How to Boost Accuracy & Speed in Satellite & Aerial Image Object Detection

Add Your Heading Text Here

				
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")

Webinar: How to Speed Up NLP Inference Performance on NVIDIA GPUs

You May Also Like

[Webinar] How to Speed Up YOLO Models on Snapdragon: Beyond Naive Quantization

[Webinar] How to Evaluate LLMs: Benchmarks, Vibe Checks, Judges, and Beyond

[Webinar] How to Boost Accuracy & Speed in Satellite & Aerial Image Object Detection

Share

Add Your Heading Text Here