YOLOX is an object detection model pre-trained on COCO 2017 dataset.
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun and Megvii Technology, in the paper, “YOLOX – Exceeding YOLO Series in 2021”
July 18, 2021
August 6, 2021
YOLOX uses a large backbone called Darknet-53. The architecture of the backbone includes 1×1 convolutional layers, residual connections, and 3×3 convolutional layers, making YOLOX a powerful feature extractor.
To get the feature information from the Darknet backbone, the model uses a Feature Pyramid Network (PANNet). A feature pyramid network extracts the information from an image with different aspects (widths and heights). The figure below illustrates the full workflow.
In the previous versions of YOLO, i.e., YOLOv3-v5, the detection head was coupled, which means the detection head performs two different tasks: classification and regression. YOLOX uses a decoupled head, one for classification and another for the bounding box regression.
In the original paper, a series of experiments were conducted to compare different-size variants of the YOLOX architecture. The sizes of these variants, including their parameters and GFLOPs, are detailed below:
The YOLOX architecture consists of three parts. The backbone, neck, and head. The input of the model is an image. The input is passed through a Convolutional Neural Network (CNN) backbone to extract features (embedding) out of it. These features are passed through the neck; the job of the neck stage is to mix and combine the features formed in the CNN backbone to prepare for the head step. Furthermore, the head uses these feature maps to output the localization and classification scores.
The YOLOv3 backbone and YOLOX backbone are the same, but the models have different heads, as illustrated in the figure below. While YOLOX uses a decoupled head, YOLOv3 uses a coupled head.
YOLOX outputs 3 tensors, each tensor holds different information instead of 1 massive tensor with all the information. The YOLOX outputs the following information.
Each “pixel” in the height and width of the output is a different bounding box prediction. So there are 3(heads) * head_output_W * head_output_H different predictions. Three outputs from the feature pyramid network (PANNet) are fed into the head of the YOLOX. Thus, from each of the heads, we have three different outputs instead of 1. The output of the YOLOX is 3 of each of the Cls, Reg and IOU outputs making 9 total outputs.
Introduced in 2015, YOLO is a state-of-the-art object detection algorithm whose speed has become a standard for object detection in the field of Computer Vision. YOLO outperforms previous object detection algorithms, including Sliding Window Object Detection, RCNN, Fast RCNN, and Faster RCNN.
YOLOX achieves a better trade-off between speed and accuracy than its counterparts, including PP-YOLOv2, YOLOv3, YOLOv4, and YOLOv5. YOLOX improves on the YOLOv3 architecture, which is a widely used detector in the industry because of its broad compatibility.
Some real-world applications of YOLOX include:
from transformers import AutoFeatureExtractor, AutoModelForImageClassification extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50") model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")