An Overview of State of the Art (SOTA) DNNs

an image depicting space

Deep learning models coupled with the right image data can be used to solve real-life problems that we come across every day, such as medical image analysis, video conferencing, and autonomous driving. However, addressing these challenges efficiently with AI is a challenge. In many cases, only the best model, AKA state of the art (SOTA) can meet the limitations of a given application. In this blog, we discuss the various ways in which we handle this data. We begin by defining what State of the Art (SOTA) deep neural networks (DNNs) in this context mean and then focus on various types of SOTA DNNs that exist for solving the different problems. Finally, we introduce ways to maximize the performance in production using DeciNets.

What are SOTA DNNs? How do we measure if a Deep Learning Model is a SOTA Model?

State-of-the-art (SOTA) DNNs are the best models you can use for any particular task. A DNN can be identified as SOTA based on its accuracy, speed, or any other metric of interest. However, in most computer vision areas, there is a trade-off between these metrics. That is, one can have a very fast DNN but its accuracy isn’t up to the mark. Other times, we might be able to build a model with good performance metrics but it would lack the required latency or throughput across various tasks, such as image classification and detection.

See below the accuracy versus latency trade-off for image classification DNNs on a GPU:

A graph showing Neural Network performance trade-off on an NVIDIA T4 GPU

Fig: Neural Network performance trade-off on an NVIDIA T4 GPU

The metrics we usually use to compare and evaluate DNNs are accuracy, precision, recall, F1-score for classification tasks, IoU, and mAP for object detection. A DNN will be declared state-of-the-art based on a combination of these metrics and additional performance metrics of interest, such as FLOPS, latency, throughput, and more. In the figure above, we can see a clear demonstration of the accuracy-latency trade-off, which gives us a clear picture of how we define SOTA DNNs.

Recommended SOTA DNNs per Computer Vision Task

The current field of computer vision consists of various tasks for image handling. These include image classification, object detection, and semantic segmentation. Computer vision models are used extensively in medical imaging, video conferencing, smart retail, agritech, and autonomous driving applications. These models use an excellent feature extractor called Convolutional Neural Networks (CNNs), which use 2-D convolutions to extract important feature vectors from input images. Some of the state-of-the-art CNN models are used in various tasks as follows:


The core of deep learning today is mainly feature extraction, which ideally requires generating multi-level features through a deep neural network (multiple layers). Image classification is the only current method for classifying an image according to different class types while solving a particular problem. Suppose we have an image of a boat, along with a set of labels such as human, vehicle, and plant. The idea of mapping the input image of a boat to a vehicle is called “image classification”. 

Convolutional neural networks (CNN) solve this problem, by learning feature vectors of the input image using various convolution operations that use pooling and sampling strategies. A typical convolution neural network is composed of modules, which are composed of a convolution, pooling, fully connected layer, etc.

The general architecture of a classification CNN consists of – (i) Stem, (ii) N x Stage, (iii) FCN. The stem module usually consists of a few convolutions and max-pooling operations to achieve downsampling of the original image with more channels. The stage module consists of multiple blocks of the primary network, which forms the main feature extractor of the architecture and consists of residual connections between multiple blocks. This can be seen in both ResNet and InceptionV3 architectures.

Finally, a fully connected layer (FCN), which is usually an ANN with multiple layers with softmax or any other activation function on the classification head, provides an output of the probabilities of a class for the input provided. The most important aspect of classification models is their reusability; these models are re-used for various other tasks such as detection and segmentation. This is often done by removing the classification head and using the backbone of the model as a feature extractor, thereby making it very useful to work with. 

Here are the most popular and high performing DNNs for classification:


CoAtNets (Convolution and Self-Attention Network) are a family of hybrid DNNs released in 2021 by Google Research. They were built mainly from two important principles, (i) Unifying depthwise convolution operations along with self-attention from Transformers into simple relative attention; and (ii) Stacking Conv layers along with attention layers vertically to improve the overall generalization and efficiency of CNNs.

A figure showing an Overview of a CoAtNet architecture

Fig: Overview of a CoAtNet architecture (Source: CoAtNet on GitHub)

The improved capabilities of this network were understood in depth by benchmarking the model against ImageNet-1K and 21K datasets and JFT-300M (an internal Google dataset). The increased efficiency can be seen clearly with the reduction in the number of parameters (#Params) and Floating-Point Operations (#FLOPs) by 55.4% (377M to 168M params) and 29.9% (289.8B to 203.1B FLOPs) respectively. CoAtNet-6 and CoAtNet-7 achieve the best performance with state-of-the-art accuracy of 90.45% and 90.88% respectively as Top-1 Accuracy on the ImageNet dataset overtaking NFNet, EfficientNet, and Vision Transformers.

ResNet (Residual Network)

ResNet, a.k.a Residual Network architecture, was developed with the main intent of designing very-deep networks that did not suffer from the “vanishing gradient” problem that was prevalent in its predecessors. The ResNet family consists of models with different numbers of ResNet blocks/layers, such as – 34, 50, 101, 152, and 1202. 

A figure showing an Architecture of a ResNet family

Fig: Architecture of a ResNet family – ResNet, Res2Net, ResNeXt, ResNext

The structure of a ResNet consists of a feedforward network with a residual connection. The operations in these residual blocks are varied, based on the different architectures of residual networks. Residual connections as a technique are widely used in the majority of the modern architectures, along with Inception blocks in models such as Inception-v4, ResNeXt, etc.

ResNeXt is a variant of ResNet that uses a split-transform-merge strategy to generate feature vectors. It then combines them by merging the outputs from different paths together with residual connections to previous blocks. This helps the model easily adapt to the newer datasets/tasks while reducing the number of adjustable hyper-parameters as compared to Inception models.

ResNext achieves a Top-1 accuracy of 84.8% (32x16d) and 85.4% (32x48d) on the ImageNet dataset, although this comes at a high latency cost. ResNet models are a good option for evaluating your image classification model if your application requires processing images that are approximately 300×300 in size. But the throughput of these models is not necessarily going to be predictive of performance for larger images of size, say 650×650. This is because it is difficult to process larger models due to their higher compute and having a much larger memory requirement.


EfficientNet is a CNN architecture that belongs to the family of models found automatically using Neural Architecture Search (NAS). EfficientNet models use a compound coefficient to uniformly scale the different dimensions of width, depth, and resolution. This policy, unlike traditional approaches, does not scale the factors arbitrarily but follows a compound scaling method.

Suppose the availability of computational resources increases by a factor of 2N . This policy helps in increasing network depth by 𝛂, width by 𝜷 and image resolution or size by 𝜸, where these coefficients are determined by grid search. EfficientNet models also use a compounding coefficient Ф, scale the above coefficients in a uniform manner.

The models range from B0 to B7 depending on the number of model parameters. The model paper shows that the authors used Neural Architecture Search to increase the efficiency of the model and optimize both accuracy and FLOPS. The main block of EfficientNet is an MBConv (mobile inverted bottleneck convolution) to which squeeze-and-excitation optimization is added. The baseline B0 model of EfficientNet achieves a 77.3% accuracy on the ImageNet Dataset, using only 5.3M parameters and 0.39B FLOPs, a 90% decrease in the number of FLOPs compared to a ResNet-50 model. The largest EfficientNet model in the family of these models is EfficientNet-B7, which outperforms all the other models from B0-B6 on ImageNet with an 84.4% Top-1 accuracy and 97.1% Top-5 accuracy; it also uses approximately 66M parameters as part of the training with 37B FLOPs in total. 

A figure of an Architecture of an EfficientNet-B0 model

Fig: Architecture of an EfficientNet-B0 model (Source: ResearchGate)


MobileNet is a small, low-latency, and low-power architecture that uses depthwise separable convolutions to construct really lightweight deep CNNs. It provides an efficient model for mobile and embedded vision applications, which significantly reduces the number of parameters compared to an architecture with regular convolutions. These architectures were originally designed to maximize the performance on edge devices and embedded applications.

A figure of an A detailed architecture of a MobileNet model

Fig: A detailed architecture of a MobileNet model (Source: ResearchGate)

MobileNet architectures currently have three different versions, which were all released in the past few years. They are MobileNet v1-v3. The architectures have also achieved state-of-the-art among the lightweight models on datasets such as ImageNet on Top1 and Top5 accuracy metrics, using just around 5M parameters.

MobileNet v2 is an updated version of v1 with higher efficiency and performance metrics. It is almost twice as fast as the v1 with just around 300 MACs (multiply-accumulate operations) as compared to 570 MACs in v1 along with just around 80% of the total parameters used in v1. This makes it much more efficient than its predecessor. Although the performance of MobileNet models is acceptable and deployable on edge devices, it is not comparable to that of ResNet and Efficient models in most use cases, due to its lower accuracy and additional evaluation metrics.

Object Detection SOTA DNNs

Object Detection is one of the most important and useful branches of computer vision. It is used in identifying and describing the contents of an image along with the location of these respective objects using bounding boxes. There are many approaches to perform this task in the modern world. Two of the common approaches include: (i) Single Shot Detection (architectures such as RetinaNet, YOLOv3, etc.), (ii) Using a Region Proposal Network to find objects in an image and a second CNN backbone network to fine-tune the generated proposals to make predictions (two-stage networks such as RCNN, Faster RCNN).

A picture of an example of an Object Detection model applied on a real-world scenario

Fig: An example of an Object Detection model applied on a real-world scenario (Source: Silicon Icarius)

The modern approach to solving object detection problems is through single shot detection models, due to their low latency time and performance on various evaluation metrics, which are at par with two-stage detectors.

Here are the most common models for object detection:


YoloV5 is one of the fastest and most accurate models used for real-time object detection. Developed by Ultralytics, YoloV5 was released in June 2020. Yolov5 is actually a family of models, consisting of the following four: s, m, l, and x. Each of these models offers different accuracy and performance capabilities.

A graph showing yolo performance

YoloV5 mAP: 55.6

YoloV5 computational power: 17GFlops

YoloV5 can be used with Pytorch and ONNX. Compared to YoloV3, YoloV5 requires 75% less operations to achieve the same results.

SSD Single Shot MultiBox Detectors

A figure showing a model architecture of SSD

Fig: The model architecture of SSD (Source: Yolo V 3 network from scratch in pytorch on YouTube)

The SSD-series of object detection models are Single Stage Detectors (SSD). These models require only a single shot to detect multiple objects in a particular image using multi-box. In addition, these sets of models are significantly faster both in terms of speed and accuracy and can be seen as a pyramid representation of images at various scales.

The SSD300 and 500 have high FPS of 59 and 22 with mAP of 74.3 and 76.9 on a VOC2007 dataset, whereas a Faster R-CNN has a much lower FPS of 7 and mAP of 73.2. The advantages of using SSDs are that – (i) they eliminate the use of bounding box proposal networks used in RCNN models, and (ii) they use a progressively decreasing convolution filter for predictions of bounding-boxes and class of objects. 

The base network acts as a feature extractor which is usually a VGG-16 in SSDs without the fully connected layers. For the detection of objects at multiple scales, there are additional CNN layers added to the base network, which decrease in size progressively. The network uses a matching strategy that picks the predicted box with the highest overlap with ground truth, along with MultiBox loss which contains both classification and localization loss ( L = Lcls + 𝛂Lloc), differentiating it from the two-stage detectors currently used. This makes the SSD model state-of-the-art in terms of both higher frame rate (FPS) and mean average precision (mAP). Recent models such as RetinaNet are built using the same principle with different losses such as Focal Loss, further improving the benchmark results on different datasets such as COCO dataset.


The RetinaNet is again a one-stage object detection model consisting of the Feature Pyramid Network (FPN) and the Focal Loss (FL), which makes its architecture stand out from other single-stage models to achieve state-of-the-art status. Focal loss was mainly designed to assign higher weights on hard or easily misclassified samples, such as images with higher background noise and texture. It also down-weighs the easy samples at the same time to add balance between both these classes.

A figure showing a RetinaNet Architecture

Fig: A RetinaNet Architecture (Source:

The FPN on the other hand is the magic in this network, which contains a sequence of pyramid levels. Each stage contains multiple convolution layers of the same size. As the layer progresses further, the stage size is scaled down to ½ of the previous layer. This helps improve the performance for detecting larger objects. The model architecture also inculcates a classification subnet which, for each anchor box, outputs a probability for K classes and regresses this anchor box to the closest truth in the regression subnet.

Semantic Segmentation SOTA Models

A picture of an Image Segmentation example

Fig: An example of Image Segmentation (Source: Cityscapes)

Semantic Segmentation is a system that assigns labels to each cluster of an image. The predictions made are at pixel-level and are based on the category the pixel belongs to. The tasks related to segmentation are mostly benchmarked against some of the standard datasets such as PASCAL VOC, Cityscapes, etc.

Here are the most common models for segmentation:


SegNet models have an encoder-decoder network architecture that is usually followed by a pixel-wise classification layer that assigns labels to each pixel. The encoder architecture mainly consists of 13 Conv Layers from the original VGG-16 model from Oxford, excluding the fully connected layers from the network. As part of the encoder, max pooling of size (2×2) is set in place and the indices of corresponding maximum values are stored. 

A figure of the Architecture of a SegNet model

Fig: The Architecture of a SegNet model (Source: ResearchGate)

The Decoder architecture, on the other hand, acts as an upsampling mechanism. Necessary upsampling and convolutions are performed to convert the encoded image to its original size, followed by pixel-wise softmax classification. The model outperforms DeConvNet and U-Net on various occasions. It attains state-of-the-art global average accuracy (G), mean Intersection over Union (mIOU), and Boundary F1 measure on CamVid Road Segmentation dataset with 90.4% G and mIOU of 60.10. Similarly, it also outperforms these models on a SUN RGB-D Dataset for Indoor Scene Segmentation by achieving an mIOU of 31.84 and global average accuracy of 72.63%. SegNet has a lower memory requirement both during training and testing, because of its model size.

Mask R-CNN

A deep neural network architecture designed to solve image and instance segmentation problems in Computer Vision. Mask R-CNN can separate the different objects present in images and videos. This is done by determining the masks for each of the objects detected, along with Bounding boxes and the class it belongs to. It is built on top of Faster R-CNN.

Mask R-CNN Architecture consists of two stages. Firstly, proposal generation using Region Proposal Network (RPN). Second, class prediction of the candidate object detected. The model at the first stage generates multiple proposals about potential regions where the object could be present. The proposals are generated from the input image using anchor boxes and by assigning IoU to each of the candidate bounding boxes.

The overlap between multiple bounding boxes around the same object is resolved using Non Maximum Suppression (NMS). Next, the classification head of the Mask R-CNN architecture predicts the class type of an object and generates a “mask” consisting of pixel-level clusters. This procedure of this stage is similar to an RPN, but the differentiating factor is that in this stage, the model uses ROIAlign in place of Anchor Generation to locate relevant areas in the generated maps. Both these stages are connected to a backbone structure similar to a detection module. The backbone is usually a convolutional architecture used for feature extraction. 

A figure of the overall architecture of Mask R-CNN

Fig: The overall architecture of Mask R-CNN (Source: ResearchGate)

Mask R-CNN often uses CNN models such as ResNet and ResNeXt with a depth of 50 or 101 layers as the backbone of the model architecture. Overall, the ResNet-FPN backbone for feature extraction produces excellent gains in terms of both accuracy and speed. The modeling uses a multi-task loss on each sampled RoI as L = Lcls + Lbox + Lmask where Lcls is the classification loss and Lbox is the bounding-box loss. Finally, the network also has an attached “head” for both bounding-box recognition and mask prediction which is usually applied to all candidate RoIs separately. The model outperformed MNC, FCIS+OHEM models on the COCO Dataset by achieving an AP50 and AP75 score of {60.0, 39.4} and {58.0, 37.8} respectively on ResNet-101-FPN and ResNeXt-101-FPN backbones. 

Therefore, the factors that determine if a model is state-of-the-art depend on various factors and are not solely dependent on evaluation metrics as discussed in the above models. 

We also see that across all the different computer vision tasks, there are some conditions to deploying a model into production and using it in real-life scenarios. It needs to be scalable, lightweight, and easy-to-use with good performance across various tasks. This is the case for image classification and object detection tasks.

What’s Next for DNNs?

The current need for increased performance on more challenging tasks is driving a growing need for more powerful DNNs, tools, and hardware This has pushed the deep learning models to generalize better on various tasks. It has also increased the size of these models with bulky architectures to achieve state-of-the-art performance. As illustrated in the below chart, DNNs are getting larger and larger with a growing number of parameters.

A graph showing the exponential growth of number of parameters in deep learning models

Fig: Exponential growth of number of parameters in deep learning models

Although these DNN models do achieve excellent results, they tend to be very large in size, thereby increasing the latency and making it difficult to deploy and scale easily. 

Hence, it has become important to work on sustainable solutions that can deliver cost-effective inference during production and ensure that models can be implemented on any hardware including resource-constrained devices such as mobile phones, laptops, or other edge devices. 

Intro to DeciNets – A New Generation of NAS-based Efficient DNNs

Deci has tackled this problem by developing Automated Neural Architecture Construction (AutoNAC) – a very efficient NAS-based technology that automatically generates the best DNNs for any hardware, use case, and performance requirements. To date, the AutoNAC engine has generated many groundbreaking deep neural networks that outperform all well-known state-of-the-art models across various computer vision tasks and hardware types. These DNNs are called DeciNets.

A proprietary family of neural networks, DeciNets include various models for image classification, object detection, and semantic segmentation and deliver better accuracy-latency tradeoff than any known open-source neural net on the market, including EfficientNets and MobileNets, YoloV5, among others. Importantly, each of the DeciNets was discovered by using roughly four times the computation required to train a single network. To compare, EfficientNet was discovered using NAS technology that took two orders of magnitude more compute power.

DeciNets for Image Classification

Optimized for NVIDIA T4 hardware, a popular cloud machine, and edge device NVIDIA Jetson Xavier NX GPU, the next two graphs show how DeciNets redefined the efficient frontiers for image classification models.

A graph of neural network performance tradeoff for image classification on NVIDIA T4

Fig: Neural network performance tradeoff for image classification on NVIDIA T4 GPU. Accuracy vs. latency (ms) for DeciNet instances (blue) and various well-known deep learning classification models. All models were quantized to 8-bit precision using TensorRT quantization.

A figure of neural network performance tradeoff for image classification on NVIDIA Jetson Xavier

Fig: Neural network performance tradeoff for image classification on NVIDIA Jetson Xavier NX GPU. Accuracy vs. latency (ms) for DeciNet instances (blue) and various well-known deep learning classification models. Quantization levels were selected for each model to maximize accuracy-latency tradeoff. FP16 quantized models appear as triangles, while INT8 quantized models appear as dots. All models were quantized using TensorRT quantization.

DeciNets also outperformed well-known models, both in terms of accuracy and latency when running inference on another edge device, Intel’s Cascade Lake CPU. The following graph shows benchmarks of a range of classification models for the CPU hardware and ImageNet dataset, compiled and quantized with OpenVino.

A figure of Neural network performance tradeoff for image classification on Intel’s Cascade Lake CPU

Fig: Neural network performance tradeoff for image classification on Intel’s Cascade Lake CPU. Each of these models were compiled and quantized with OpenVino.

DeciNets for Object Detection

Optimized again for NVIDIA T4 hardware, and an edge device iPhone 12 Pro, the graphs below show how DeciNets redefined the efficient frontiers for object detection models. Better DNNs for an iPhone, in particular, don’t only result in better performance or higher throughput, but also reduce compute power or battery usage.

A graph showing neural network performance tradeoff for object detection on NVIDIA T4

Fig: Mean average precision vs. latency (ms) for DeciNet instances (blue) and various well-known deep learning object detection models. All models were quantized to 8-bit precision using TensorRT quantization.

A graph showing neural network performance for object detection on iphone

Fig: Mean average precision vs. latency (ms) for DeciNet instances (blue) and various well-known deep learning object detection models.

DeciNets can be optimized for specific tasks and hardware types (cloud, edge, or mobile) and fine-tuned for any dataset using an open-source computer vision training library called SuperGradients

With DeciNets, developers and enterprises can quickly achieve outstanding accuracy and runtime performance on any hardware and guarantee success in production. These custom DNNs can deliver cost-effectiveness on the cloud and enable applications on edge devices. If you are interested in learning more about how you can leverage DeciNets models for your project, talk with our experts.

You May Also Like

Mastering LLM Adaptations: A Deep Dive into Full Fine-Tuning, PEFT, Prompt Engineering, and RAG

15 times Faster than Llama 2: Introducing DeciLM – NAS-Generated LLM with Variable GQA

Announcing Infery-LLM – An Inference SDK for LLM Deployment Redefining State-of-the-Art in LLM Inference

The latest deep learning insights, tips, and best practices delivered to your inbox.

Add Your Heading Text Here
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")