Description
YOLOX is an object detection model pre-trained on COCO 2017 dataset.
Publishers
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun and Megvii Technology, in the paper, “YOLOX – Exceeding YOLO Series in 2021”
Submitted Version
July 18, 2021
Latest Version
August 6, 2021
Size
0.91M-99.1M
YOLOX uses a large backbone called Darknet-53. The architecture of the backbone includes 1×1 convolutional layers, residual connections, and 3×3 convolutional layers, making YOLOX a powerful feature extractor.
To get the feature information from the Darknet backbone, the model uses a Feature Pyramid Network (PANNet). A feature pyramid network extracts the information from an image with different aspects (widths and heights). The figure below illustrates the full workflow.
In the previous versions of YOLO, i.e., YOLOv3-v5, the detection head was coupled, which means the detection head performs two different tasks: classification and regression. YOLOX uses a decoupled head, one for classification and another for the bounding box regression.
In the original paper, a series of experiments were conducted to compare different-size variants of the YOLOX architecture. The sizes of these variants, including their parameters and GFLOPs, are detailed below:
Model | Parameters | GFLOPs |
YOLOX-Nano | 0.91M | 1.08 |
YOLOX-Tiny | 5.06M | 6.45 |
YOLOX-S | 9.0M | 26.8 |
YOLOX-M | 25.3M | 73.8 |
YOLOX-L | 54.2M | 155.6 |
YOLOX-X | 99.1M | 281.9 |
Expected Input
The YOLOX architecture consists of three parts. The backbone, neck, and head. The input of the model is an image. The input is passed through a Convolutional Neural Network (CNN) backbone to extract features (embedding) out of it. These features are passed through the neck; the job of the neck stage is to mix and combine the features formed in the CNN backbone to prepare for the head step. Furthermore, the head uses these feature maps to output the localization and classification scores.
The YOLOv3 backbone and YOLOX backbone are the same, but the models have different heads, as illustrated in the figure below. While YOLOX uses a decoupled head, YOLOv3 uses a coupled head.
YOLOX outputs 3 tensors, each tensor holds different information instead of 1 massive tensor with all the information. The YOLOX outputs the following information.
Each “pixel” in the height and width of the output is a different bounding box prediction. So there are 3(heads) * head_output_W * head_output_H different predictions. Three outputs from the feature pyramid network (PANNet) are fed into the head of the YOLOX. Thus, from each of the heads, we have three different outputs instead of 1. The output of the YOLOX is 3 of each of the Cls, Reg and IOU outputs making 9 total outputs.
Introduced in 2015, YOLO is a state-of-the-art object detection algorithm whose speed has become a standard for object detection in the field of Computer Vision. YOLO outperforms previous object detection algorithms, including Sliding Window Object Detection, RCNN, Fast RCNN, and Faster RCNN.
YOLOX achieves a better trade-off between speed and accuracy than its counterparts, including PP-YOLOv2, YOLOv3, YOLOv4, and YOLOv5. YOLOX improves on the YOLOv3 architecture, which is a widely used detector in the industry because of its broad compatibility.
Some real-world applications of YOLOX include:
YOLOX was pre-trained on the COCO 2017 object detection dataset. The COCO (Common Objects in Context) dataset is an image recognition dataset for object detection, segmentation, and image captioning tasks. The COCO dataset comprises over 330,000 images, each annotated with 80 object categories. It is widely used to train and evaluate many state-of-the-art object detection and segmentation models. The dataset’s annotations are provided in JSON format for each single image.
Evaluation Metrics
Evaluation metrics are used to measure the quality of the model. When you build your model, it is very crucial to measure how accurately it predicts your expected outcome. We have different evaluation metrics for different sets of machine learning algorithms. For evaluating classification models, we use classification metrics.
Evaluation metrics can help you assess your model’s performance, monitor your ML system in production, and control your model to fit your business need. It is crucial to use multiple evaluation metrics to evaluate your model as a model may perform well using one measurement from one evaluation metric, but may perform poorly using another measurement from another evaluation metric.
Mean Average Precision (mAP)
Mean Average Precision is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, and Mask R-CNN, among others. It is calculated over recall values from 0 to 1. A higher mean average precision indicates better accuracy.
The original paper reported that the pre-trained models achieved the following mAP values on COCO.
Model | mAP % |
YOLOX-Nano | 25.3 |
YOLOX-Tiny | 32.8 |
YOLOX-S | 39.6 |
YOLOX-M | 46.4 |
YOLOX-L | 50.0 |
YOLOX-X | 51.2 |
Using the training recipes for YOLOX-Nano and YOLOX-Tiny, available in SuperGradients, Deci’s open-source computer vision library, you can reach a higher mAP score of 26.77% and 37.18% respectively. If you’d like to use these recipes, refer to the instructions in the How to Use section below.
When selecting an architecture, there are several things you should carefully consider:
Clarifying these topics before you start training the model can save you a lot of time, effort, and money.
Below, see how to easily load and fine-tune a production-ready, pre-trained YOLOX model that incorporates best practices and validated hyperparameters for achieving best-in-class accuracy. For the sake of this example, we’ll use YOLOX-tiny, but with SuperGradients, Deci’s open source, all-in-one computer vision training library, you can also access additional pre-trained YOLOX models, including YOLOX-Nano, YOLOX-S, YOLOX-M, and YOLOX-L.
Define your dataset path and where you want your checkpoints to be saved and you are good to go from your terminal.
First, ensure that the data is stored in dataset_params.dataset_dir or add “dataset_params.data_dir=<PATH-TO-DATASET>” at the end of the command below. You can find instructions here.
Next, move to the project root (where you will find the ReadMe and src folder)
Finally, run the command:
# python -m super_gradients.train_from_recipe --config-name=coco2017_yolox architecture=yolox_t
Try a pre-trained YOLOX model on your machine. Import SuperGradients, initialize your Trainer, and load your desired YOLOX model and pre-trained weights.
# The pretrained_weights argument will load a pre-trained architecture on the provided dataset import super_gradients model = models.get("yolox_t", pretrained_weights="COCO") In the first step, we will clone the super gradients repo and install the source code. git clone https://github.com/Deci-AI/super-gradients.git
Production-ready models means they are compatible with deployment tools such as TensorRT (NVIDIA) and OpenVINO (Intel) and can be easily taken into production.
To export to ONNX, use the following:
# Load model with pretrained weights from super_gradients.training import models from super_gradients.common.object_names import Models model = models.get(Models.yolox_t, pretrained_weights="coco") # Prepare model for conversion # Input size is in format of [Batch x Channels x Width x Height] where 640 is the standard COCO dataset dimensions model.eval() model.prep_model_for_conversion(input_size=[1, 3, 640, 640]) # Create dummy_input # Convert model to onnx torch.onnx.export(model, dummy_input, "yolox_t.onnx")
For more code examples, recipes, and advanced training techniques such as transfer learning, knowledge distillation, and more, refer to SuperGradients on GitHub.
Apache 2.0
Original paper license: Apache 2.0
Dataset used to train YOLOX: COCO 2017 dataset
SuperGradients documentation: https://deci-ai.github.io/super-gradients/welcome.html
We’d love your feedback on the information presented in this card. Please also share any unexpected results.
For a short meeting with the SuperGradients team, use this link and choose your preferred time.
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")
model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")