PP-LiteSeg is a semantic segmentation model pre-trained on Cityscapes and CamVid.
Juncai Peng, Yi Liu, Shiyu Tang, Yuying Hao, Lutao Chu, Guowei Chen, Zewu Wu, Zeyu Chen, Zhiliang Yu, Yuning Du, Qingqing Dang, Baohua Lai, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma, in the paper, “PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model”
April 6, 2022
The authors introduced the PP-LiteSeg model for real-time semantic segmentation to address the limitations of existing methods. The lightweight model uses a modified encoder-decoder architecture that incorporates three similarly novel modules: Flexible and Lightweight Decoder (FLD), Unified Attention Fusion Module (UAFM), and Simple Pyramid Pooling Module (SPPM).
PP-LiteSeg makes use of a lightweight network as the encoder, with the team choosing STDCNet to extract hierarchical features from the input image. STDCNet, composed of 5 stages, employs a stride of 2, resulting in a final feature size that is 1/32 of the original image. The model then incorporates SPPM to effectively handle long-range dependencies. With FLD, the model fuses multi-level features. Ultimately, a segmented image is generated with predicted labels for each pixel.
The team had different settings for each dataset. In the Cityscapes dataset, the training configuration includes a batch size of 16, a maximum of 160,000 iterations, an initial learning rate of 0.005, and a weight decay of 5e−4 in the optimizer. Meanwhile, for the CamVid dataset, the training settings involve a batch size of 24, a maximum of 1,000 iterations, an initial learning rate of 0.01, and a weight decay of 1e−4.
The input to PP-LiteSeg is an image. The image resolution may vary based on the specific dataset. In the paper, the team used the following input settings for each dataset utilized:
PP-LiteSeg produces an output image of the same size as the input image, with each pixel assigned a label indicating the segmentation of objects or regions present in the input image. After SPPM outputs an image that contains global context information which is then fed into the FLD to incorporate multi-level features, the UAFM outputs fuse features by downsampling them with a ratio of 1/8. The number of channels decreases in the 1/8 downsampled feature to match the number of classes.
Semantic segmentation plays a vital role in a wide range of applications, including but not limited to autonomous driving, robot sensing, and video surveillance. It enables machines to accurately identify and classify objects within an image at the pixel level.
Despite notable advancements in this field, many existing models face limitations when it comes to achieving real-time segmentation with optimal performance. Some models demand substantial computational resources, resulting in compromised inference speeds, which makes them unsuitable for real-time applications. Others are unable to achieve a desirable balance between speed and accuracy.
The proposed PP-LiteSeg design seeks to tackle these issues. The model has already seen real-world application in road segmentation.
from transformers import AutoFeatureExtractor, AutoModelForImageClassification extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50") model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")