Deploy on Your Terms

Deploy our models via API, in your Virtual Private Cloud or directly on-site, keeping data in-house.

Gain Unparalleled
Inference Performance

Achieve low latency, high throughput
to improve user experience.

Cut Your
Inference Cost

Maximize hardware utilization or migrate your workloads to more affordable cloud instances.

Tailor Performance
for Your Needs

Easily adjust production settings
to fit your use cases.

Deci's models are accelerated with Infery-LLM

Advanced inference optimization techniques developed by Deci’s experts

Optimized Kernels

Speed up the prefill and decoding stages of generation with custom kernels optimized for grouped query attention. Adaptable to various decoder architectures.

Continuous Batching

Ensures GPU is always decoding at a maximal batch size and that every generated token is used. Sequences dynamically grouped and swapped upon completion for efficient response generation.

Optimized Beam Search

Gain faster sequence-to-sequence prediction with efficient search mechanism. Supports all common generation parameters and is highly tuned for the target inference hardware.

Selective Quantization

Apply either FP16 or INT8 quantization only to the layers that are quantization friendly to enjoy the speed-up of quantization while maintaining FP32 quality.

"Our advanced text to videos solution is powered by proprietary and complex generative AI algorithms. Deci allows us to reduce our cloud computing cost and improve our user experience with faster time to video by accelerating our models’ inference performance and maximizing GPU utilization on the cloud.”

Lior Hakim, Co-Founder & CTO
Hour One

Cut Your Compute Cost with Faster LLM Inference

Optimize and run your models with Infery, Deci’s easy-to-use LLM inference SDK.

Related resources

Video

Deci

Generative AI

Deci

Generative AI

Deci

Deploy on Your Terms

Gain Unparalleled Inference Performance

Cut Your Inference Cost

Tailor Performance for Your Needs