Description
PP-LiteSeg is a semantic segmentation model pre-trained on Cityscapes and CamVid.
Publishers
Juncai Peng, Yi Liu, Shiyu Tang, Yuying Hao, Lutao Chu, Guowei Chen, Zewu Wu, Zeyu Chen, Zhiliang Yu, Yuning Du, Qingqing Dang, Baohua Lai, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma, in the paper, “PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model”
Submitted Version
April 6, 2022
Latest Version
N/A
Size
N/A
The authors introduced the PP-LiteSeg model for real-time semantic segmentation to address the limitations of existing methods. The lightweight model uses a modified encoder-decoder architecture that incorporates three similarly novel modules: Flexible and Lightweight Decoder (FLD), Unified Attention Fusion Module (UAFM), and Simple Pyramid Pooling Module (SPPM).
PP-LiteSeg makes use of a lightweight network as the encoder, with the team choosing STDCNet to extract hierarchical features from the input image. STDCNet, composed of 5 stages, employs a stride of 2, resulting in a final feature size that is 1/32 of the original image. The model then incorporates SPPM to effectively handle long-range dependencies. With FLD, the model fuses multi-level features. Ultimately, a segmented image is generated with predicted labels for each pixel.
The team had different settings for each dataset. In the Cityscapes dataset, the training configuration includes a batch size of 16, a maximum of 160,000 iterations, an initial learning rate of 0.005, and a weight decay of 5e−4 in the optimizer. Meanwhile, for the CamVid dataset, the training settings involve a batch size of 24, a maximum of 1,000 iterations, an initial learning rate of 0.01, and a weight decay of 1e−4.
Expected Input
The input to PP-LiteSeg is an image. The image resolution may vary based on the specific dataset. In the paper, the team used the following input settings for each dataset utilized:
Expected Output
PP-LiteSeg produces an output image of the same size as the input image, with each pixel assigned a label indicating the segmentation of objects or regions present in the input image. After SPPM outputs an image that contains global context information which is then fed into the FLD to incorporate multi-level features, the UAFM outputs fuse features by downsampling them with a ratio of 1/8. The number of channels decreases in the 1/8 downsampled feature to match the number of classes.
Semantic segmentation plays a vital role in a wide range of applications, including but not limited to autonomous driving, robot sensing, and video surveillance. It enables machines to accurately identify and classify objects within an image at the pixel level.
Despite notable advancements in this field, many existing models face limitations when it comes to achieving real-time segmentation with optimal performance. Some models demand substantial computational resources, resulting in compromised inference speeds, which makes them unsuitable for real-time applications. Others are unable to achieve a desirable balance between speed and accuracy.
The proposed PP-LiteSeg design seeks to tackle these issues. The model has already seen real-world application in road segmentation.
The authors trained PP-LiteSeg B50 on two datasets: Cityscapes and CamVid.
Cityscapes is a dataset specifically designed for urban segmentation tasks. It consists of a substantial collection of 5,000 finely annotated images, which are divided into training, validation, and testing sets with 2,975, 500, and 1,525 images, respectively. The images in the dataset have a resolution of 2048 × 1024, which is challenging for real-time semantic segmentation. The annotated images have a total of 30 classes, but the authors only used 19 classes for a fair comparison with other methods.
Meanwhile, CamVid is a dataset used for road scene segmentation. It comprises 701 images with high-quality pixel-level annotations, with 367 images allocated for training, 101 images for validation, and 233 images for testing. All images in the dataset have a resolution of 960 × 720. The annotated images encompass a total of 32 categories, but the team only used a subset of 11 categories for experimentation.
Evaluation Metrics and Results
Evaluation metrics are used to measure the quality of the model and how accurately it predicts the expected outcome. There are different evaluation metrics for different sets of machine learning algorithms. They help in assessing models’ performance, monitor machine learning systems in production, and control models to fit a given business need.
It is crucial to use multiple evaluation metrics to evaluate models as one may perform well using one measurement from one evaluation metric, but may perform poorly using another measurement from another evaluation metric.
Accuracy
mIoU, or mean Intersection over Union, is a common metric used to evaluate the performance of image segmentation models. It measures the overlap between the predicted segmentation and the ground truth segmentation, by calculating the ratio of the intersection between the two to their union for each class. The mIoU score is then calculated as the average of these ratios across all classes. A higher mIoU score indicates better segmentation accuracy.
Throughput (FPS)
FPS, or Frames Per Second, is a measure of the performance of a video or computer vision system, indicating how many frames or images can be processed or displayed per second. In computer vision applications, such as object detection or semantic segmentation, the FPS rate indicates the speed of the system in analyzing and processing incoming frames or images. A higher FPS rate generally means that the system can process more data in a given time and is therefore more responsive and efficient.
The paper reported the following accuracy metrics on Cityscapes. PP-LiteSeg achieves a mIoU of 72.0% at a frame rate of 273.6 FPS. On the NVIDIA GTX 1080Ti, it achieves a mIoU of 77.5% at a frame rate of 102.6 FPS. In addition, on the CamVid test set, PP-LiteSeg-B bests other methods with a 75.0% mIoU at 154.8 FPS.
Model | Paper MIoU (Val) | Paper MIoU (Test) | SG mIoU |
PP-LiteSeg-T1 | 73.1% | 72.0% | 74.92% |
PP-LiteSeg-B1 | 75.3% | 73.9% | 76.48% |
PP-LiteSeg-T2 | 76.0% | 74.9% | 77.56% |
PP-LiteSeg-B2 | 78.2% | 77.5% | 78.52% |
The last column shows how the values compare against SuperGradients’ results. SG is an open source training library for easily training or fine-tuning SOTA computer vision models.
When selecting an architecture, there are several things you should carefully consider:
Having clarity on these topics before you start training the model can save you a lot of time, effort, and money.
The graph shows a comparison of state-of-the-art models such as PP-Lite, FC-HardNet, STDC, DDRnet, PIDNet, and how they compare to the AutoNAC generated DeciSeg models both in terms of accuracy and latency.
Performance metrics reported:
All the models presented were trained on the Cityscapes dataset and compiled to FP16 with NVIDIA TensorRT on the NVIDIA Jetson Xavier NX device.
You can use the baseline model for semantic segmentation. Below, see how you can easily load and fine-tune a production-ready, pre-trained PP-LiteSeg models that incorporate best practices and validated hyperparameters for achieving best-in-class accuracy.
Define your dataset path and where you want your checkpoints to be saved and you are good to go from your terminal. Just make sure that you setup your dataset according to the data dir specified in the recipe.
python -m super_gradients.train_from_recipe --config-name=cityscapes_pplite_seg50 checkpoint_params.checkpoint_path= architecture=pp_lite_b_seg dataset_params.train_dataset_params.root_dir= dataset_params.val_dataset_params.root_dir= ckpt_root_dir= ;
Try pre-trained PP-LiteSeg B50 model on your machine. Import SuperGradients, initialize your Trainer, and load your desired PP-LiteSeg B50 model and pre-trained weights.
# The pretrained_weights argument will load a pre-trained architecture on the provided dataset import super_gradients model = models.get("pp_lite_b_seg50", pretrained_weights="cityscapes")
Production-ready models mean they are compatible with deployment tools such as TensorRT (NVIDIA) and OpenVINO (Intel) and can be easily taken into production.
# # Load model with pretrained weights from super_gradients.training import models from super_gradients.common.object_names import Models model = models.get("pp_lite_b_seg50", pretrained_weights="cityscapes") # Prepare model for conversion # Input size is in format of [Batch x Channels x Width x Height model.eval() model.prep_model_for_conversion(input_size=[1, 3, 1024, 2048]) # Create dummy_input X = torch.randn(1, 3, 1024, 2048) # Convert model to onnx torch.onnx.export(model, dummy_input, "pp_lite_seg_b.onnx")
For more code examples, recipes, and advanced training techniques such as transfer learning, knowledge distillation, and more, refer to SuperGradients on GitHub.
Apache 2.0
We’d love your feedback on the information presented in this card. Please also share any unexpected results.
For a short meeting with the SuperGradients team, use this link and choose your preferred time.
from transformers import AutoFeatureExtractor, AutoModelForImageClassification
extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")
model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")