How to Achieve FP32 Accuracy with INT8 Quantization Inference Speed

Resource Featured Image

Watch Deci’s experts, Ofer Baratz and Borys Tymchenko, PhD, in a hands-on technical session about INT8 quantization.

✅ Learn the different quantization techniques and best practices for accelerating speed without degrading your models’ accuracy.

✅ Check out code examples and tools that you can easily leverage to achieve your inference performance targets.

The Impact of Quantization

An INT8 or INT4 quantized model executes operations on tensors with reduced precision. This allows for a more compact model representation and the use of high-performance vectorized operations. INT8 quantization results in higher throughput, lower latency, and a smaller model size.

When Should You Consider INT8 Quantization?

  • If you have runtime performance issues like high latency or low throughput
  • If your memory consumption during inference is too high, or your model is too large to load
  • If your hardware is limited in terms of power consumption and can therefore only work with INT8 precision.

INT8 quantization addresses all these problems.

Common Quantization Pitfalls

However, quantization is far from straightforward: while there are obvious advantages to quantization, there are also common pitfalls that must be avoided, including:

  • Architecture not amenable to quantization
    • Some architectures have layers and blocks that are not quantizable
    • Some neural networks aren’t amenable to post-training quantization
  • Model accuracy degradation
    • Potential accuracy drop of up to 50%
  • Calibration challenges (Calibration is a must for INT8 quantization)
    • Different distributions of weights per channel 
    • Different distributions of activations
    • Longtail distribution of activations 
    • No one-size-fits-all solution for calibration

How to Address the Pitfalls of INT8 Quantization

The focus of the webinar is on how to avoid these common quantization pitfalls through hybrid and selective quantization in both post-training quantization (PTQ) and quantization-aware training (QAT).

Unlike naive quantization, hybrid and selective quantization do not apply the same quantization methods to all model layers.

Hybrid Quantization

  • Uses different calibrators for weights and activations
  • Skips certain model layers entirely

Selective Quantization

  • Applies different quantization methods to different layers
  • Replaces whole blocks with quantization-friendly ones
  • Quantizes residual connections in models

To delve into an in-depth analysis of quantization methods and best practices, simply sign up above to gain access to the webinar.

Looking to accelerate deep learning model inference for your use case? Book a demo here.

Access Webinar Now

Add Your Heading Text Here
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")