Description
The Vision Transformer (ViT) is an image classification model pretrained in ImageNet (ILSVRC-2012), ImageNet-21k, and JFT.
Publishers
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, in the paper, “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale“
Submitted Version
October 22, 2020
Latest Version
June 21, 2021
Size
86M, 307M, and 632M
ViT consists of a sequence of flattened 2D patches derived from an image, which a Transformer encoder processes. The encoder consists of alternating layers of multiheaded self-attention and MLP blocks, each with a constant latent vector size D.
The Vision Transformer has several variants, each with different sizes and configurations:
Layers | MLP Size | Heads | Parameters | |
ViT-Base | 12 | 768 | 12 | 86M |
ViT-Large | 24 | 1024 | 16 | 307M |
ViT-Huge | 32 | 1280 | 16 | 632M |
Expected Input
The Vision Transformer (ViT) model takes a sequence of flattened 2D patches derived from an image as input.
The image, denoted as x, with pixels in the [0, 255] range and in the dimension of (H×W×C), is reshaped into a sequence of patches, xp, which belongs to the set of real numbers R in the dimension of (N×(P^2·C)).
Here, (H, W) represents the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = HW/P^2 is the resulting number of patches and also serves as the effective input sequence length for the Transformer.
The patches derived from the image are transformed into a lower-dimensional space using a trainable linear projection. This transformation process, which we refer to as “flattening,” results in a set of patch embeddings.
To the sequence of these patch embeddings, we add a learnable embedding at the beginning, similar to the [class] token in the BERT model. This process is known as “prepending.”
We add position embeddings to the patch embeddings to retain the positional information of the patches. The position embeddings used are standard learnable 1D embeddings.
It’s often beneficial to use a higher resolution during the fine-tuning stage than during pre-training. When we feed images of higher resolution to the model, we keep the patch size the same.
This results in a larger effective sequence length. To adjust for the change in resolution, we modify the pre-trained position embeddings using a process called 2D interpolation. This process adjusts the embeddings according to their location in the original image.
In both the pre-training and fine-tuning stages, we attach a classification head to the encoding of the [CLS] token. This classification head is a multi-layer perceptron (MLP) with one hidden layer during pre-training. During fine-tuning, the classification head is simplified to a single linear layer.
Expected Output
The model outputs a probability distribution over the target classes, indicating the predicted class of the input image. The class with the highest probability is chosen as the final prediction.
During both pre-training and fine-tuning, a classification head is attached to [CLS] token encoding. The classification head is implemented by a multi-layer perceptron (MLP) with one hidden layer during pre-training and a single linear layer during fine-tuning.
The image representation serves as the basis for the model’s predictions. The exact form of the output will depend on the specific task and the configuration of the classification head.
ViT treats an image as a sequence of patches and processes it using a standard Transformer encoder, similar to how text is processed in Natural Language Processing (NLP). This approach, while simple, has shown to be surprisingly effective when combined with pre-training on large datasets.
The Vision Transformer has been shown to match or even exceed the state-of-the-art performance on many image classification datasets while being relatively cost-effective in terms of pre-training. The model’s ability to integrate information across the entire image, even in the lowest layers, thanks to its self-attention mechanism, is one of its key strengths.
However, the Vision Transformer is not limited to image classification tasks. It holds promise for other computer vision tasks, such as detection and segmentation. Furthermore, the model’s performance can be improved through further scaling and exploration of self-supervised pre-training methods. Initial experiments have shown improvement from self-supervised pre-training, but a significant gap exists between self-supervised and large-scale supervised pre-training.
The Vision Transformer’s ability to handle images as sequences of patches opens up new possibilities for its application in various fields. Its performance is expected to improve as research in this area continues.
The Vision Transformer (ViT) was trained and evaluated on multiple datasets to explore its scalability and performance across different tasks and conditions. The primary datasets used include:
ImageNet (ILSVRC-2012): This large-scale dataset has 1,000 classes and 1.3 million images. It is a subset of the larger ImageNet dataset and is commonly used for benchmarking image classification models.
ImageNet-21k: This is a superset of the ImageNet dataset, containing 21,000 classes and 14 million images. It provides a more challenging task due to increased classes and images.
JFT: This dataset contains 18,000 classes and 303 million high-resolution images. It is one of the most extensive publicly available image datasets. It is used to test the model’s scalability.
Evaluation Metrics
There are two primary evaluation methods for ViT models: fine-tuning accuracy and few-shot accuracy.
Fine-tuning accuracy measures the performance of a model after it has been fine-tuned on a specific dataset.
Few-shot accuracy is obtained by solving a regularized least-squares regression problem that maps the representation of a subset of training images to target vectors. This formulation allows for the exact solution to be recovered in a closed form. Although the primary focus is fine-tuning performance, linear few-shot accuracies are sometimes used for fast on-the-fly evaluation where fine-tuning is too costly.
In terms of computational resources, the Vision Transformer models are evaluated based on the number of TPUv3-core-days taken to pre-train each of them. This metric represents the number of TPU v3 cores used for training multiplied by the training time in days.
In terms of performance, the ViT-L/16 model pre-trained on the JFT-300M dataset outperforms the BiT-L model (which is pre-trained on the same dataset) on all tasks, while requiring substantially fewer computational resources to train. The larger ViT-H/14 model further improves performance, especially on more challenging datasets like ImageNet, CIFAR-100, and the VTAB suite. Interestingly, this model still took substantially less compute to pre-train than the prior state-of-the-art.
For example, on the VTAB tasks, the ViT-H/14 model achieved an accuracy of 77.63%, outperforming the BiT-L model which achieved an accuracy of 76.29%. On the ImageNet dataset, the ViT-H/14 model achieved an accuracy of 88.55%, while the ViT-L/16 model achieved an accuracy of 87.76%.
The inference performance of the model depends on the hardware and the specific task. However, due to the large model size and the complexity of the transformer architecture, the inference speed of the ViT model might be slower compared to traditional CNNs.
The Vision Transformer is a potent tool for various computer vision tasks. The next section can guide you on how to load and fine-tune a production-ready, pre-trained ViT model that incorporates the best practices and validated hyperparameters for achieving top-tier accuracy.
Define your dataset path and the directory where you wish to save your checkpoints, and you’re all set. Ensure that your dataset is set up according to the data directory specified in the recipe.
python -m super_gradients.examples.train_from_recipe_example.train_from_recipe architecture=vit_base dataset_interface.data_dir= ckpt_root_dir=
Using SuperGradients, you can easily load a pre-trained ViT model onto your machine. Just initialize your Trainer and load your desired ViT model with pre-trained weights.
from super_gradients.training import models from super_gradients.common.object_names import Models model = models.get(Models.vit_base, pretrained_weights="ImageNet21K")
The models trained with SuperGradients are production-ready. They are compatible with deployment tools like TensorRT (NVIDIA) and OpenVINO (Intel) which makes it convenient for you to move them into production.
# Load model with pretrained weights from super_gradients.training import models from super_gradients.common.object_names import Models model = models.get(Models.vit_base, pretrained_weights="ImageNet21K") # Prepare model for conversion # Input size is in format of [Batch x Channels x Width x Height] model.eval() model.prep_model_for_conversion(input_size=[1, 3, 224, 224]) # Create dummy_input # Convert model to onnx torch.onnx.export(model, dummy_input, "vit_base.onnx")
For more code examples, recipes, and advanced training techniques such as transfer learning, knowledge distillation, and more, refer to SuperGradients on GitHub.
Apache-2.0
We’d love your feedback on the information presented in this card. Please also share any unexpected results.
For a short meeting with the SuperGradients team, use this link and choose your preferred time.
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")
model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")