PP-LiteSeg is a semantic segmentation model pre-trained on Cityscapes and CamVid.

Juncai Peng, Yi Liu, Shiyu Tang, Yuying Hao, Lutao Chu, Guowei Chen, Zewu Wu, Zeyu Chen, Zhiliang Yu, Yuning Du, Qingqing Dang, Baohua Lai, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma, in the paper, “PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model

Submitted Version
April 6, 2022

Latest Version




Semantic Segmentation


Model Highlights

  • Task: Semantic Segmentation
  • Model type: Convolutional Neural Network
  • Framework: PaddlePaddle
  • Dataset: Cityscapes and CamVid

Model Size and Parameters

The authors introduced the PP-LiteSeg model for real-time semantic segmentation to address the limitations of existing methods. The lightweight model uses a modified encoder-decoder architecture that incorporates three similarly novel modules: Flexible and Lightweight Decoder (FLD), Unified Attention Fusion Module (UAFM), and Simple Pyramid Pooling Module (SPPM).

PP-LiteSeg makes use of a lightweight network as the encoder, with the team choosing STDCNet to extract hierarchical features from the input image. STDCNet, composed of 5 stages, employs a stride of 2, resulting in a final feature size that is 1/32 of the original image. The model then incorporates SPPM to effectively handle long-range dependencies. With FLD, the model fuses multi-level features. Ultimately, a segmented image is generated with predicted labels for each pixel.

The team had different settings for each dataset. In the Cityscapes dataset, the training configuration includes a batch size of 16, a maximum of 160,000 iterations, an initial learning rate of 0.005, and a weight decay of 5e−4 in the optimizer. Meanwhile, for the CamVid dataset, the training settings involve a batch size of 24, a maximum of 1,000 iterations, an initial learning rate of 0.01, and a weight decay of 1e−4.

Image illustrating how the PP-LiteSeg model accomplishes real-time semantic segmentation using the novel encoder-decoder structure with three modules

Expected Input

The input to PP-LiteSeg is an image. The image resolution may vary based on the specific dataset. In the paper, the team used the following input settings for each dataset utilized:

  • For Cityscapes, the cropped resolution is 1024 × 512. The team also evaluated PP-LiteSeg-T and PP-LiteSeg-B at 512 × 1024 and 768 × 1536, respectively.
  • For CamVid, the cropped resolution is 960 × 720.


Expected Output

PP-LiteSeg produces an output image of the same size as the input image, with each pixel assigned a label indicating the segmentation of objects or regions present in the input image. After SPPM outputs an image that contains global context information which is then fed into the FLD to incorporate multi-level features, the UAFM outputs fuse features by downsampling them with a ratio of 1/8. The number of channels decreases in the 1/8 downsampled feature to match the number of classes.

History and Applications

Semantic segmentation plays a vital role in a wide range of applications, including but not limited to autonomous driving, robot sensing, and video surveillance. It enables machines to accurately identify and classify objects within an image at the pixel level.

Despite notable advancements in this field, many existing models face limitations when it comes to achieving real-time segmentation with optimal performance. Some models demand substantial computational resources, resulting in compromised inference speeds, which makes them unsuitable for real-time applications. Others are unable to achieve a desirable balance between speed and accuracy.

The proposed PP-LiteSeg design seeks to tackle these issues. The model has already seen real-world application in road segmentation.

Add Your Heading Text Here
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")