Optimize and run your models with Infery, Deci’s easy-to-use LLM inference SDK.
Achieve low latency, high throughput to improve user experience.
Maximize hardware utilization or migrate your workloads to more affordable cloud instances.
Streamline deployment. Run inference in 3 lines of code.
3-10x faster LLM inference
Up to 95% lower compute cost
Easy to use
Compatible with SOTA models
Speed up the prefill and decoding stages of generation with custom kernels optimized for grouped query attention. Adaptable to various decoder architectures.
Ensures GPU is always decoding at a maximal batch size and that every generated token is used. Sequences dynamically group and swapped upon completion for efficient response generation.
Gain faster sequence-to-sequence prediction with efficient search mechanism. Supports all common generation parameters and is highly tuned for the target inference hardware.
Apply either FP16 or INT8 quantization only to the layers that are quantization friendlly to enjoy the speed-up of quantization while maintaining FP32 quality.
Lior Hakim, Co-Founder & CTO
Hour One
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")
model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")