How to Achieve FP32 Accuracy with INT8 Quantization Speed

Watch Deci’s experts, Ofer Baratz and Borys Tymchenko, PhD, in a hands-on technical session about INT8 quantization.

✅ Learn the different quantization techniques and best practices for accelerating speed without degrading your models’ accuracy.

✅ Check out code examples and tools that you can easily leverage to achieve your inference performance targets.

The Impact of Quantization

An INT8 or INT4 quantized model executes operations on tensors with reduced precision. This allows for a more compact model representation and the use of high-performance vectorized operations. INT8 quantization results in higher throughput, lower latency, and a smaller model size.

When Should You Consider INT8 Quantization?

If you have runtime performance issues like high latency or low throughput
If your memory consumption during inference is too high, or your model is too large to load
If your hardware is limited in terms of power consumption and can therefore only work with INT8 precision.

INT8 quantization addresses all these problems.

Common Quantization Pitfalls

However, quantization is far from straightforward: while there are obvious advantages to quantization, there are also common pitfalls that must be avoided, including:

Architecture not amenable to quantization
- Some architectures have layers and blocks that are not quantizable
- Some neural networks aren’t amenable to post-training quantization
Model accuracy degradation
- Potential accuracy drop of up to 50%
Calibration challenges (Calibration is a must for INT8 quantization)
- Different distributions of weights per channel
- Different distributions of activations
- Longtail distribution of activations
- No one-size-fits-all solution for calibration

How to Address the Pitfalls of INT8 Quantization

The focus of the webinar is on how to avoid these common quantization pitfalls through hybrid and selective quantization in both post-training quantization (PTQ) and quantization-aware training (QAT).

Unlike naive quantization, hybrid and selective quantization do not apply the same quantization methods to all model layers.

Hybrid Quantization

Uses different calibrators for weights and activations
Skips certain model layers entirely

Selective Quantization

Applies different quantization methods to different layers
Replaces whole blocks with quantization-friendly ones
Quantizes residual connections in models

To delve into an in-depth analysis of quantization methods and best practices, simply sign up above to gain access to the webinar.

Looking to accelerate deep learning model inference for your use case? Book a demo here.

How to Achieve FP32 Accuracy with INT8 Quantization Inference Speed

The Impact of Quantization

When Should You Consider INT8 Quantization?

Common Quantization Pitfalls

How to Address the Pitfalls of INT8 Quantization

Hybrid Quantization

Selective Quantization

Access Webinar Now

How to Achieve FP32 Accuracy with INT8 Quantization Inference Speed

The Impact of Quantization

When Should You Consider INT8 Quantization?

Common Quantization Pitfalls

How to Address the Pitfalls of INT8 Quantization

Hybrid Quantization

Selective Quantization

Access Webinar Now

Share

Add Your Heading Text Here