EfficientNet is a convolutional neural network (CNN) architecture pre-trained on CIFAR-10 and CIFAR-100, Birdsnap, Stanford Cars, Flowers, FGVC Aircraft, Oxford-IIIT Pets, and Food-101 datasets.
Mingxing Tan, Quoc V. Le, in the paper, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”
May 28, 2019
September 11, 2020
5.3M to 66M
The authors developed a baseline network by leveraging a multi-objective neural architecture search that optimizes accuracy and FLOPS. Flops are optimized rather than latency because no specific hardware device is a target. The resultant network is efficient, and hence it is named EfficientNet. While researchers found that accuracy improved with increasing network breadth, depth, or resolution, they also found that this benefit waned with increasing model size. Better accuracy and efficiency can be achieved by ConvNet scaling by striking a balance between the network’s width, depth, and resolution.
EfficientNet models are trained on ImageNet by applying RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99, weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. The model also uses SiLU (Swish-1) activation, AutoAugment, and stochastic depth with a survival probability of 0.8. The EfficientNet model’s dropout ratio is 0.2. To report the final validation accuracy, it first sets aside 25K randomly selected images from the training set as a minival set and then does early stopping on this minival.
The following table lists the sizes of the different EfficientNet models in terms of number of parameters and FLOPs.
|Model||# of Parameters||# of FLOPs|
The expected input of an EfficientNet model is a float tensor of pixels with values in the [0-255] range.
The expected output of an EfficientNet model depends on the task. For image classification, the expected output is a probability distribution over the classes.
To acquire the best results, Convolutional Neural Networks (ConvNets) are typically trained with a limited budget and then scaled up when more resources become available. Improving the accuracy of ConvNets involves scaling them up. The most typical method is to increase the ConvNet’s depth or width. Model scaling based on image resolution is another approach that is gaining traction but is still not widely used. It is typical practice in prior work to scale only one depth, breadth, or image size dimension. While it is possible to arbitrarily scale a dataset with two or three dimensions, doing so is laborious and often results in sub-optimal accuracy and efficiency. The developers of EfficientNet set out to fix this.
EfficientNet was created by utilizing a neural architecture search that optimized for both accuracy and FLOPS.
Several real-world applications using EfficientNet have already reached the hardware memory restriction; thus, improved efficiency is required for any additional gains in accuracy. For example:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50") model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")