In production, inference performance matters tremendously as it directly translates to operational costs. So how can you improve it?
In a recent joint project, NVIDIA and Deci collaborated on accelerating a question-answering NLP model on NVIDIA A100 GPUs.
With NLP practitioners frequently reaching out to tried and tested model architectures such as BERT-Large, Najeeb and Adam illustrate how developers can substantially accelerate the inference performance of any model (in this case BERT Large) – while maintaining model accuracy to maximize utilization during inference and reduce inference costs.
- Best practices for NLP models selection and design
- TensorRT Plugins and how to use them correctly
- Post Training Quantization vs Quantization Aware Training and best practices for INT8 quantization
Watch now, and if you want to learn more about accelerating the inference performance of your NLP use case? Book a demo here.