LLMs and other generative AI models such as stable diffusion are power-hungry. Their large model sizes and the intensive and complex computations involved in training and inference translate into huge computational demands. What’s more, generating responses during inference requires multiple forward passes through the model, adding to the computational cost. The combination of extremely large models and variable inference costs means that your generative AI applications come at a significantly high operational cost. As your inference scales, so does your cloud bill. Therefore, the ability to optimize and deploy generative models efficiently is crucial.
Today, we introduce extended support for generative AI model optimization with Deci’s Infery library. You can now easily achieve accelerated inference speed and massive cloud cost reductions for models such as stable diffusion, T5, and Bart, among many others.
Using the Infery library, you can leverage advanced compilation and quantization techniques and deploy a diverse range of generative models. In this blog post, we delve into the technical details of these processes and illustrate what it means in terms of speed and inference cost.
Read on to learn how, with Deci, you can run your generative ai models on affordable and widely available GPUs by improving inference speed without compromising on accuracy.
When it comes to Gen AI, traditional optimization techniques don’t cut it
In traditional compilation and quantization techniques, we aim to reduce computational complexity and storage requirements, predominantly for models employed in resource-constrained environments such as mobile devices or embedded systems. However, applying these techniques to generative AI models, which are characteristically large and complex, often fails to achieve the desired performance improvements.
The reason for this failure is twofold. First, these models typically feature a substantial degree of nonlinearity and high-dimensional parameter spaces that can make them resistant to straightforward compression techniques like quantization. Second, conventional compilation strategies focus on statically optimizing a computation graph for a particular hardware configuration, which may not be sufficient for dynamic generative models. As these models often involve iterative and stochastic processes with variable computational paths (for example, generating a sentence or image piece-by-piece), they can’t be easily squeezed into a static graph without sacrificing model expressivity or performance. Thus, while traditional compilation and quantization can work well for more straightforward models like those used in conventional image classification or regression tasks, they may not yield the expected performance benefits when applied to the more intricate and dynamic architectures associated with generative AI.
A new approach to generative model optimization
Using Deci’s Infery library, engineers can speed up the inference performance of their generative AI models and reduce their inference cloud cost by 2-4x.
Infery’s latest version enables users to perform a hybrid compilation and selective quantization. In this process, every sub-component and layer of the model are automatically profiled and paired with the optimal production-orientated framework and quantization level and then seamlessly fused, all while taking into account the inference hardware characteristics.
Quantization allows for the compression of model parameters and reduces the memory footprint required for inference. This enables efficient deployment on cost-effective instances. Deci’s advanced quantization techniques allow developers to enjoy the speed-up of quantization while maintaining FP32 quality.
Infery allows developers to maximize the acceleration potential of complex architectures while saving valuable time and effort. The ability to compress the model and its memory footprint also expands the possibilities of potential deployment hardware to be used and reduces the number of expensive cloud instances used for inference.
How to accelerate LLMs and other Generative AI models with Infery:
In the following code snippets, you can see just how easy it is to speed up your generative models and gain 2-4x performance.
Start by importing your model and tokenizer (Hugging Face architectures are supported), set your optimization parameters, initiate optimization, and let Infery do the heavy lifting for you.
from infery.ffm.models import EncDecModel, optimize_pretrained, torch, trt from transformers import AutoTokenizer # Optimization Parameters MODEL_SOURCE = "HF" MODEL_NAME = "google/flan-t5-xxl" TARGET_PATH = "my_xxl_model" INPUT_SEQ_LENGTH = 256 FROM_FRAMEWORK = "torch" TO_FRAMEWORK = "trt" # Given a model source and name, optimize it and place the results in TARGET_PATH optimize_pretrained( MODEL_SOURCE, MODEL_NAME, FROM_FRAMEWORK, TO_FRAMEWORK, TARGET_PATH, quantization = "INT8", input_sequence_length = (1, INPUT_SEQ_LENGTH) ) # Generation params MAX_INPUT_LENGTH = 256 MAX_TARGET_LENGTH = 256 BEAM_SIZE = 4 PROMPT = "Write me a recipe for meatless lasagne" # Tokenize tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) inputs = tokenizer( text = PROMPT, max_length=MAX_INPUT_LENGTH, return_tensors="pt" ) # Load and generate optimized_model = EncDecModel(TARGET_PATH) output_sequences = optimized_model.generate( input_ids=inputs["input_ids"], max_length=MAX_TARGET_LENGTH, num_beams=BEAM_SIZE )
Infery’s approach of seamlessly using different frameworks to utilize an optimal combination for your model enables you to boost runtime performance while saving valuable time and effort. Infery enables using optimal kernels and tensor-level parallelization for each combination of model application and HW in order to boost the production inference performance. Infery’s latest version supports a diverse range of generative models, including LLMs, image generation, text-to-image synthesis, music composition, and more.
Faster Inference means better business results
Leading companies with products powered with generative AI are already using Deci to speed up their models’ performance, ship differentiated products while also cutting their cloud cost and carbon footprint. Here are some of their stories:
A customer offering an AI platform for text summarization was able to shrink their model size by 50% and accelerate latency by 3.92x. This optimization led to a striking 68% decrease in cloud costs.
Another customer, offering a video generation application powered by a GAN model for image generation, was struggling to reach their inference throughput target. By using Deci’s Infery, the team was able to reduce the model’s latency by 2.1x. Consequently, they could process their videos using fewer machines, leading to a substantial 40% reduction in their cloud costs.
Efficient Deployment Simplified
With the extension of Infery’s support for generative AI models, AI developers across various domains can now unlock the full potential of their generative AI models.
Deci significantly reduces inference latency and resource consumption. This translates into faster and more efficient AI applications, improving user experiences and lowering compute costs.
Deci’s streamlined optimization process eliminates the complexities of deploying generative models, enabling faster time-to-market for AI-powered products and services.
Whether it’s generating realistic images, creating and editing textual content, synthesizing creative artwork, or composing unique melodies, using Deci, teams can gain a competitive edge by rapidly deploying innovative AI applications and capturing market opportunities ahead of their competitors.
Check out Deci’s Stable Diffusion acceleration demo, or speak with one of our experts for a personalized demo.