EfficientNet is a convolutional neural network (CNN) architecture pre-trained on CIFAR-10 and CIFAR-100, Birdsnap, Stanford Cars, Flowers, FGVC Aircraft, Oxford-IIIT Pets, and Food-101 datasets.

Mingxing Tan, Quoc V. Le, in the paper, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Submitted Version
May 28, 2019

Latest Version
September 11, 2020

5.3M to 66M

Image Classification


Model Highlights

  • Task: Image classification
  • Model type: Convolutional neural network
  • Framework: Tensorflow
  • Dataset: CIFAR-10, CIFAR-100, Birdsnap, Stanford Cars, Flowers, FGVC Aircraft, Oxford-IIIT Pets, and Food-101

Model Size and Parameters

The authors developed a baseline network by leveraging a multi-objective neural architecture search that optimizes accuracy and FLOPS. Flops are optimized rather than latency because no specific hardware device is a target. The resultant network is efficient, and hence it is named EfficientNet. While researchers found that accuracy improved with increasing network breadth, depth, or resolution, they also found that this benefit waned with increasing model size. Better accuracy and efficiency can be achieved by ConvNet scaling by striking a balance between the network’s width, depth, and resolution.

EfficientNet models are trained on ImageNet by applying RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99, weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. The model also uses SiLU (Swish-1) activation, AutoAugment, and stochastic depth with a survival probability of 0.8. The EfficientNet model’s dropout ratio is 0.2. To report the final validation accuracy, it first sets aside 25K randomly selected images from the training set as a minival set and then does early stopping on this minival.

The following table lists the sizes of the different EfficientNet models in terms of number of parameters and FLOPs.

Model# of Parameters# of FLOPs

Expected Input

The expected input of an EfficientNet model is a float tensor of pixels with values in the [0-255] range.

Expected Output

The expected output of an EfficientNet model depends on the task. For image classification, the expected output is a probability distribution over the classes.

History and Applications

To acquire the best results, Convolutional Neural Networks (ConvNets) are typically trained with a limited budget and then scaled up when more resources become available. Improving the accuracy of ConvNets involves scaling them up. The most typical method is to increase the ConvNet’s depth or width. Model scaling based on image resolution is another approach that is gaining traction but is still not widely used. It is typical practice in prior work to scale only one depth, breadth, or image size dimension. While it is possible to arbitrarily scale a dataset with two or three dimensions, doing so is laborious and often results in sub-optimal accuracy and efficiency. The developers of EfficientNet set out to fix this.

EfficientNet was created by utilizing a neural architecture search that optimized for both accuracy and FLOPS. 

Several real-world applications using EfficientNet have already reached the hardware memory restriction; thus, improved efficiency is required for any additional gains in accuracy. For example:

  • Flower species detection
  • Food dish recognition
  • Bird species recognition
  • Automobile, aeroplane recognition
  • Pet species recognition
Add Your Heading Text Here
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")