The Vision Transformer (ViT) is an image classification model pretrained in ImageNet (ILSVRC-2012), ImageNet-21k, and JFT.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, in the paper, “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale“

Submitted Version
October 22, 2020

Latest Version
June 21, 2021

86M, 307M, and 632M

Image Classification


Model Highlights

  • Task: Image Classification
  • Model type: Vision Transformer
  • Framework: PyTorch
  • Dataset: ImageNet & JFT

Model Size and Parameters

ViT consists of a sequence of flattened 2D patches derived from an image, which a Transformer encoder processes. The encoder consists of alternating layers of multiheaded self-attention and MLP blocks, each with a constant latent vector size D.

The Vision Transformer has several variants, each with different sizes and configurations:

 LayersMLP SizeHeadsParameters

Expected Input

The Vision Transformer (ViT) model takes a sequence of flattened 2D patches derived from an image as input.

The image, denoted as x, with pixels in the [0, 255] range and in the dimension of (H×W×C), is reshaped into a sequence of patches, xp, which belongs to the set of real numbers R in the dimension of (N×(P^2·C)).

Here, (H, W) represents the resolution of the original image, C is the number of channels, (P, P) is the resolution of each image patch, and N = HW/P^2 is the resulting number of patches and also serves as the effective input sequence length for the Transformer.

The patches derived from the image are transformed into a lower-dimensional space using a trainable linear projection. This transformation process, which we refer to as “flattening,” results in a set of patch embeddings.

To the sequence of these patch embeddings, we add a learnable embedding at the beginning, similar to the [class] token in the BERT model. This process is known as “prepending.”

We add position embeddings to the patch embeddings to retain the positional information of the patches. The position embeddings used are standard learnable 1D embeddings.

It’s often beneficial to use a higher resolution during the fine-tuning stage than during pre-training. When we feed images of higher resolution to the model, we keep the patch size the same.

This results in a larger effective sequence length. To adjust for the change in resolution, we modify the pre-trained position embeddings using a process called 2D interpolation. This process adjusts the embeddings according to their location in the original image.

In both the pre-training and fine-tuning stages, we attach a classification head to the encoding of the [CLS] token. This classification head is a multi-layer perceptron (MLP) with one hidden layer during pre-training. During fine-tuning, the classification head is simplified to a single linear layer.

Expected Output

The model outputs a probability distribution over the target classes, indicating the predicted class of the input image. The class with the highest probability is chosen as the final prediction.

During both pre-training and fine-tuning, a classification head is attached to [CLS] token encoding. The classification head is implemented by a multi-layer perceptron (MLP) with one hidden layer during pre-training and a single linear layer during fine-tuning.

The image representation serves as the basis for the model’s predictions. The exact form of the output will depend on the specific task and the configuration of the classification head.

History and Applications

ViT treats an image as a sequence of patches and processes it using a standard Transformer encoder, similar to how text is processed in Natural Language Processing (NLP). This approach, while simple, has shown to be surprisingly effective when combined with pre-training on large datasets.

The Vision Transformer has been shown to match or even exceed the state-of-the-art performance on many image classification datasets while being relatively cost-effective in terms of pre-training. The model’s ability to integrate information across the entire image, even in the lowest layers, thanks to its self-attention mechanism, is one of its key strengths.

However, the Vision Transformer is not limited to image classification tasks. It holds promise for other computer vision tasks, such as detection and segmentation. Furthermore, the model’s performance can be improved through further scaling and exploration of self-supervised pre-training methods. Initial experiments have shown improvement from self-supervised pre-training, but a significant gap exists between self-supervised and large-scale supervised pre-training.

The Vision Transformer’s ability to handle images as sequences of patches opens up new possibilities for its application in various fields. Its performance is expected to improve as research in this area continues.

Add Your Heading Text Here
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")