End to End Inference Acceleration

Deci provides you with unmatched end-to-end accuracy-preserving inference runtime acceleration for your neural models on the cloud or the edge. This optimization process is fully aware of the desired target hardware, be it GPU, CPU, or any ASIC accelerator.

TALK TO AN EXPERT

Automated Neural Architecture Construction

Deci’s groundbreaking AutoNAC technology redesigns your deep learning model to squeeze the maximum utilization out of the hardware targeted for inference in production. Deci’s AutoNAC engine contains a neural architecture search (NAS) component that revises a given trained model to optimally speed up its runtime by as much as 10x, while preserving the model’s baseline accuracy.

Main Advantages

Inference Speedup
2-10x

Accuracy
Preserving

Full Awareness of
Data and Hardware

Model Size & Memory Reduction

Inference Speedup
2-10x

Accuracy
Preserving

Full Awareness of
Data and Hardware

Model Size & Memory Reduction

AutoNAC™ - How it Works

AutoNAC™ How it Works

As input, the AutoNAC process receives the customer baseline model, the data used to train this model, and access to the target inference hardware device. AutoNAC then revises the baseline backbone layers that carry out most of the computation and redesign to be an optimal sub-network. This optimization is carried out by performing a very efficient predictive search in a large set of candidate architectures. During this process, AutoNAC probes the target hardware and directly optimizes the runtime, as measured on this specific device. The final fast architecture is then fine-tuned on the data provided, to achieve the same accuracy performance as the baseline. It is then ready for deployment.

Read Our Technical White Paper to Learn More

AutoNAC White Paper

Accelerate Deep Neural Network Inference on Any Hardware while Preserving Accuracy

DOWNLOAD WHITE PAPER

Full Stack
Inference Optimization

Benefit from the Entire Inference Acceleration Stack

The most ambitious algorithmic acceleration technique for aggressive speedups is neural architecture search (NAS). To apply a NAS optimization, one should define an architecture space and use a clever search strategy to search this space for an architecture that satisfies the desired properties. NAS optimizations are responsible for monumental achievements in DL. For instance, MobileNet-V3 and EfficientNet(Det) were found using NAS. Using aggressive NAS algorithms that require huge computational resources and must be applied in a scalable manner in production is challenging. Deci’s, AutoNAC brings into play a restricted NAS algorithm that revises a given baseline model. AutoNAC uses prior knowledge extracted from the baseline model, and relies on a very fast and accurate search strategy, allowing it to operate at scale. AutoNAC considers and leverages all the components in the inference stack, including compilers, pruning, and quantization.

Deep neural networks (DNNs) can be compressed by weight pruning, which eliminates unnecessary weights. Today, it’s difficult to use pruning to significantly speed up inference time, but it can be used to effectively reduce DNN size.

Quantization refers to the process of reducing the numerical representation (bit-width) of weights and activations, and can be used to speed up runtime if it is supported by the underlying hardware. Because both quantization and aggressive pruning can compromise accuracy metrics, the use of these techniques is limited.

The essential runtime components include drivers and compilers. Drivers implement neural network layers and primitives typically found within DL frameworks, such as TensorFlow(Keras), Pytorch, MXNet, and Cafe. These drivers must be programmed and tailored for each specific target hardware device. 

Deep neural networks (DNNs) are represented as directed acyclic graphs (DAGs), called computation graphs. A compiler optimizes the DNN graph and then generates optimized code for a target hardware. The main techniques used by compilers are vertical and horizontal operator fusion, caching tricks, and memory reuse across threads. There are many compilers around and the more popular ones are Tensor-RT (TRT), OpenVino, and TVM.

Inference hardware devices for neural networks have many forms and characteristics. Among the important factors are parallelism, shared memory size, virtual memory efficacy, and power consumption. These factors crucially affect the runtime of a given neural network. No less important is the maturity of the supporting software stack and the existence of a community of developers. Many new specialized inference devices are expected to emerge in the near future.

TALK TO AN EXPERT

Full Stack
Inference Optimization

Benefit from the Entire Inference Acceleration Stack

AutoNAC

The most ambitious algorithmic acceleration technique for aggressive speedups is neural architecture search (NAS). To apply a NAS optimization, one should define an architecture space and use a clever search strategy to search this space for an architecture that satisfies the desired properties. NAS optimizations are responsible for monumental achievements in DL. For instance, MobileNet-V3 and EfficientNet(Det) were found using NAS. Using aggressive NAS algorithms that require huge computational resources and must be applied in a scalable manner in production is challenging. Deci’s, AutoNAC brings into play a restricted NAS algorithm that revises a given baseline model. AutoNAC uses prior knowledge extracted from the baseline model, and relies on a very fast and accurate search strategy, allowing it to operate at scale. AutoNAC considers and leverages all the components in the inference stack, including compilers, pruning, and quantization.

Model Compression

Deep neural networks (DNNs) can be compressed by weight pruning, which eliminates unnecessary weights. Today, it’s difficult to use pruning to significantly speed up inference time, but it can be used to effectively reduce DNN size.

Quantization refers to the process of reducing the numerical representation (bit-width) of weights and activations, and can be used to speed up runtime if it is supported by the underlying hardware. Because both quantization and aggressive pruning can compromise accuracy metrics, the use of these techniques is limited.

Runtime

The essential runtime components include drivers and compilers. Drivers implement neural network layers and primitives typically found within DL frameworks, such as TensorFlow(Keras), Pytorch, MXNet, and Cafe. These drivers must be programmed and tailored for each specific target hardware device. 

Deep neural networks (DNNs) are represented as directed acyclic graphs (DAGs), called computation graphs. A compiler optimizes the DNN graph and then generates optimized code for a target hardware. The main techniques used by compilers are vertical and horizontal operator fusion, caching tricks, and memory reuse across threads. There are many compilers around and the more popular ones are Tensor-RT (TRT), OpenVino, and TVM.

Inference Hardware

Inference hardware devices for neural networks have many forms and characteristics. Among the important factors are parallelism, shared memory size, virtual memory efficacy, and power consumption. These factors crucially affect the runtime of a given neural network. No less important is the maturity of the supporting software stack and the existence of a community of developers. Many new specialized inference devices are expected to emerge in the near future.

TALK TO AN EXPERT

Apply the technology on your use case!

  • Accelerate your inference performance
  • Reach production faster
  • Maximize your hardware potential
BOOK A DEMO

More Resources

Press Release

Deci Named One of CB Insights' 100 Most Innovative Startups

READ MORE
Toward Defining AutoML for Deep Learning

Blog

Toward Defining AutoML for Deep Learning

READ MORE
How Deci and Intel Hit 11.8x Inference Acceleration at MLPerf

Blog

How Deci and Intel Hit 11.8x Inference Acceleration at MLPerf

READ MORE

Maximize the Potential of Deep Learning