Computer Vision

Continuing the Momentum: SuperGradients Introduces Pose Estimation with DEKR

Introducing Pose Estimation with DEKR in SuperGradients

In May 2023, the Deci team unveiled YOLO-NAS, our open-sourced model on  SuperGradients – an open-source library designed to streamline the training of PyTorch-based models for many network vision tasks. 

Since then, the team has been overwhelmed and inspired by your response. SuperGradients has seen a 25x growth in downloads and received over 2,000 stars on GitHub, a testament to the power and utility of our library. The Deci team is not one to rest on their laurels, though. We’re committed to expanding and improving SuperGradients, ensuring it remains your go-to resource for network vision tasks.

Today, we are excited to announce the latest enhancement – the addition of Pose Estimation, powered by a robust pretrained model, DEKR.



Pose estimation, the task of determining the position and orientation of objects (often people) in images or videos, has a wide range of applications, from sports analytics and video game development to healthcare and more. With the integration of pose estimation and DEKR, we’re widening the scope of SuperGradients, equipping you with the tools to tackle even more diverse and complex projects.

This blog post provides an introduction to pose estimation, explores its practical uses and then delves into the history, unique characteristics and capabilities of DEKR.

What is Pose Estimation?

Pose estimation is a special case of keypoint estimation, which is a task in computer vision that involves predicting the position and orientation of specific points in images or videos.

Pose estimation tracks fine-grained movements of objects or individuals. It accurately pinpoints specific keypoints, providing a more detailed picture than traditional object detection. Applications include human motion tracking for personal trainers, virtual coaches, and factory safety measures. It also enhances experiences in augmented reality, animation, gaming, and robotics.

Keypoints

Keypoints refer to specific points on an object or person that determine their orientation. In human pose estimation, keypoints usually correspond to joints like the elbows, knees, and fingertips. For objects, keypoints may be corners or unique features. Deep learning models such as Convolutional Neural Networks are used to identify and predict these keypoints. These models are trained on labeled datasets to recognize patterns and features that indicate the location of these keypoints in new images or videos. 

The accuracy of keypoint detection is crucial for successful pose estimation, and inaccurate predictions can significantly impact the performance of applications such as animation, gaming, and physical therapy.

Types of Pose Estimation

There are two main types of pose estimation: 2D and 3D pose estimation. In 2D pose estimation, the goal is to locate specific keypoints within a two-dimensional image or video frame by predicting their x and y coordinates. In 3D pose estimation, the goal is to determine the spatial coordinates of these keypoints in a three-dimensional space, providing a more comprehensive understanding of the pose including depth and orientation. 

The choice between these two methods depends on the specific requirements of the task at hand, as each has unique applications and challenges.

The Pose Estimation Pipeline

To estimate the pose of an object or person, whether it’s 2D or 3D pose estimation, you need to follow a series of steps. 

  1. Preprocess image or video data: Resize the image or video frame, normalize the pixel values, and possibly augment the data to enhance model performance. 
  2. Detect the object or person in the image or video: Use object detection algorithms to identify the object or person whose pose will be estimated. 
  3. Predict the keypoints: Run the input data through a deep neural network to predict the location of the keypoints. 
  4. Post-processing: Refine the predicted keypoints for greater accuracy by smoothing over time or applying constraints based on known relationships between keypoints.

Evaluation Metrics for Pose Estimation

So, you’ve trained a model…but how do you tell if you’ve got a good model? You’ll need to look at your model’s evaluation metrics to understand its performance better. Here are some standard metrics used for pose estimation[1]:

  1. Percentage of Correct Parts (PCP): This measure evaluates how well limbs are detected. A limb is correctly detected if the predicted joint locations are within half the limb length of the actual joint locations ([email protected]). However, this metric can disadvantage shorter limbs. To calculate the PCP, the number of correctly detected parts is divided by the total number of parts in the dataset.
  1. Percentage of Correct Key-points (PCK): This metric considers a detected joint as correct if the distance between the predicted and the actual joint is within a certain threshold. For example, [email protected] is when the threshold is 50% of the head bone link, and [email protected] is when the distance between the predicted and actual joint is less than 20% of the torso diameter. PCK is used for both 2D and 3D pose estimation.
  1. Percentage of Detected Joints (PDJ): PDJ, like PCK, determines a joint as correct if the predicted joint’s distance from the actual joint is within a certain fraction of the torso diameter. This metric is commonly used for 2D Pose Estimation and addresses the issue of shorter limbs having smaller torsos.
  1. Mean Per Joint Position Error (MPJPE): This metric measures the difference between each joint’s predicted and actual positions using the Euclidean distance. The average of these differences across all joints is then calculated. It is usually computed after adjusting the estimated and actual 3D poses to align with the root joint (typically the pelvis). Another version of this metric, Procrustes Analysis MPJPE (PA MPJPE), adjusts the estimated 3D pose to match the ground truth using the Procrustes method, a type of similarity transformation, before calculating the MPJPE.
  1. Object Keypoint Similarity (OKS): OKS is similar to the Intersection over Union (IoU) metric used in object detection but is adapted for keypoint detection. OKS measures the similarity between predicted and ground truth keypoints and is calculated based on Euclidean distance and a scale factor related to the object size. The resulting value ranges from 0 to 1, with 1 indicating a perfect match and 0 indicating no overlap. The Average Precision (AP) score can be calculated based on OKS values, providing a single score summarizing the model’s performance. This score is determined by averaging precision values at different OKS thresholds, usually ranging from 0.5 to 0.95.

Higher values are generally better for PCP, PCK, and PDJ, indicating a higher rate of correct detections. For MPJPE, lower values are better, meaning smaller errors between the predicted and actual joint positions. 

These metrics comprehensively evaluate a pose estimation model’s performance, helping identify improvement areas.

Paradigms in Pose Estimation

Pose estimation models are typically built using one of two paradigms:

  1. Top-down methods: Identify the object first and then locate the keypoints within the identified figure​. If you are working with scenes with few people or need to identify each person’s keypoints accurately, it is best to use this approach. It provides better accuracy as the pose estimation model focuses on one person at a time. The downside to these methods is they may require more computing power because the pose estimation process needs to be run separately for each person detected.
  1. Bottom-up methods: Each keypoint of the object is estimated first, then combined to form the complete pose. This approach requires a single pass of the pose estimation model, regardless of the number of people in the scene, making it more efficient. It’s a good fit for scenarios with dense crowds where individual identification might be less critical. The downside is that these methods can struggle with complex settings where people overlap, or limbs are occluded.

The DEKR model, which we’ve included in the SuperGradients model zoo, follows the bottom-up paradigm.

Introducing DEKR 

TL;DR

DEKR stands out from previous pose estimation techniques by focusing on keypoint regions, using adaptive convolutions to learn from these regions, and employing a multi-branch structure for separate regression. These features allow it to achieve more accurate keypoint regression and superior performance in empirical tests. DEKR’s unique approach of learning disentangled representations for each keypoint region results in accurate pose estimation.

In 2021, a fresh player in the game of human pose estimation was born: Disentangled Keypoint Regression, or DEKR. DEKR is the brainchild of Zigang Geng and Ke Sun from the University of Science and Technology of China and Microsoft. They developed it while interning at Microsoft Research in Beijing. They were on a mission to amp up the current pose estimation methods. They set their sights on the dense keypoint regression framework, which was an underperformer compared to the keypoint detection and grouping framework.

DEKR’s main goal? To nail the estimation of human poses from images.

This isn’t just a fun exercise—it’s vital for applications like action recognition, human-computer interaction, smart photo editing, and even pedestrian tracking. The brains behind DEKR were keen on buffing up the bottom-up paradigm of pose estimation. This approach is slicker than the top-down method, which first spots the person and then estimates each person’s pose.

What makes DEKR a real game-changer for pose estimation is its unique way of learning representations. 

It follows more than just the old-school methods. Instead, DEKR uses adaptive convolutions and pixel-wise spatial transformers to light up pixels in keypoint regions and learn from them. This lets the system zero in on the parts of an image that matter when figuring out a pose.

DEKR doesn’t stop there. 

The DEKR method employs a multi-branch structure to perform separate regressions. 

Each branch has a specific task: learning a representation through dedicated adaptive convolutions and regressing a single keypoint. Adaptive convolution is a type of convolution operation that adjusts its parameters based on the input data, unlike standard convolutions that use fixed parameters. In the DEKR method, adaptive convolutions activate pixels in keypoint regions and learn their representations. 

This allows the system to concentrate on the most relevant areas of an image to determine a pose.

The result? 

A set of disentangled representations that can each focus on their specific keypoint regions. This makes the keypoint regression more precise regarding location, boosting the overall effectiveness of the pose estimation.

When put to the test, DEKR really shines. 

It outperforms other keypoint detection and grouping methods in empirical tests, scoring top marks in bottom-up pose estimation on two benchmark datasets, COCO and CrowdPose. The secret to DEKR’s success is its unique approach: it disentangles the representations for different keypoints, letting each representation concentrate on the corresponding keypoint region.

To wrap up, DEKR offers a more efficient and accurate way to interpret human poses from images. It brings a unique approach to learning representations and with its use of a multi-branch structure for separate regression, is a practical and effective solution for many applications.

Using DEKR with SuperGradients

Using Deci’s open-source SuperGradients library, you can easily train a DEKR model from scratch or fine-tune a pretrained model to fit your specific needs using an optimized recipe that incorporates best practices and validated hyper-parameters for superior performance.

Find out how to train or fine-tune DEKR model in this notebook

Conclusion

With the addition of pose estimation and the integration of the DEKR model, SuperGradients is now an even more comprehensive tool for practitioners looking to tackle a wider range of computer vision tasks.

Over the course of this blog, we’ve given an overview of pose estimation, its applications, evaluation metrics, and paradigms. We also took a closer look at DEKR, exploring its unique characteristics and the reasons why it outperforms previous models.

As we strive to continuously improve SuperGradients, our main is to make these advanced tools accessible and straightforward for all computer vision practitioners, enabling you to carry out your projects more efficiently and effectively.

If you’re interested in trying your hand at training a pose estimation model with SuperGradients, we invite you to check out this notebook. As always, the Deci team is here to assist and provide the tools you need to reach your goals.

We would greatly appreciate your feedback after you’ve experienced SuperGradients for pose estimation. Your insights can help shape the future of SuperGradients and can be invaluable in our question to continually enhance our offering. 

Happy training!

You May Also Like

Mastering LLM Adaptations: A Deep Dive into Full Fine-Tuning, PEFT, Prompt Engineering, and RAG

15 times Faster than Llama 2: Introducing DeciLM – NAS-Generated LLM with Variable GQA

Announcing Infery-LLM – An Inference SDK for LLM Deployment Redefining State-of-the-Art in LLM Inference

The latest deep learning insights, tips, and best practices delivered to your inbox.

Add Your Heading Text Here
				
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")