YOLO-NAS Pose is a pose estimation model trained on the COCO2017 dataset.

Deci AI Team

Submitted Version
November 7, 2023

Latest Version


Pose Estimation



YOLO-NAS Pose is a pose estimation model trained on the COCO2017 dataset. Emerging from Deci’s proprietary NAS (Neural Architecture Search) engine, AutoNAC, coupled with cutting-edge training methodologies, it offers a superior latency-accuracy balance compared to YOLOv8 Pose. Specifically, the medium-sized version, YOLO-NAS Pose M, outperforms the large YOLOv8 variant with a 38.85% reduction in latency on an Intel Xeon 4th gen CPU, all while achieving a 0.27 boost in [email protected] score.

Model Highlights

  • Task: Pose estimation
  • Model type: Deep Neural Network
  • Languages (NLP): PyTorch
  • Dataset: Trained on COCO2017 dataset

Model Architecture

In pose estimation, two primary methodologies have traditionally dominated: top-down methods and bottom-up methods. YOLO-NAS Pose follows neither. Instead, it executes two tasks simultaneously: detecting persons and estimating their poses in one swift pass. This unique capability sidesteps the two-stage process inherent to many top-down methods, making its operation akin to bottom-up approaches. Yet, differentiating it from typical bottom-up models like DEKR. YOLO-NAS Pose employs a streamlined postprocessing, leveraging class NMS for predicted person boxes. The culmination of these features delivers a rapid model, perfectly primed for deployment on TensorRT.

YOLO-NAS Pose’s architecture is based on the YOLO-NAS architecture used for object detection. Both architectures share a similar backbone and neck design, but what sets YOLO-NAS Pose apart is its innovative head design crafted for a multi-task objective: simultaneous single-class object detection (specifically, detecting a person) and the pose estimation of that person. AutoNAC wasa employed to find the optimal head design, ensuring powerful representation while adhering to predefined runtime constraints.

YOLO-NAS Pose offers four distinct size variants, each tailored for different computational needs and performances:

  Number of Parameters (In millions) [email protected] Latency (ms)
Intel Xeon gen 4th (OpenVino)
Latency (ms)
Jetson Xavier NX (TensorRT)
Latency (ms)
YOLO-NAS N 9.9M 59.68 14 15.99 2.35
YOLO-NAS S 22.2M 64.15 21.87 21.01 3.29
YOLO-NAS M 58.2M 67.87 42.03 38.40 6.87
YOLO-NAS L 79.4M 68.24 52.56 49.34 8.86

Expected Input

YOLO-NAS Pose takes an image or video as an input.

Expected Output

YOLO-NAS Pose outputs bounding boxes and confidence scores for detected persons and predicted coordinates (X,Y) for each keypoint of the skeleton and confidence score of each keypoint (indicating whether model is confident specific keypoint is visible).

History and Applications

The field of pose estimation is integral to computer vision, serving a spectrum of crucial applications. From healthcare’s need to monitor patient movements and the intricate analysis of athlete performances in sports, to creating seamless human-computer interfaces and enhancing robotic systems – the demands are vast. Not to mention, sectors like entertainment and security where swift and accurate posture detection is paramount.

Earlier this year, Deci introduced YOLO-NAS, a groundbreaking object detection foundation model that gained widespread recognition. Building on YOLO-NAS, the team unveiled its pose estimation sibling: YOLO-NAS Pose.

Some real-world applications of YOLO-NAS Pose include:

  • Human pose estimation (action recognition in video analysis, fitness and healthcare monitoring)
  • 3D object tracking (autonomous vehicles for pedestrian and vehicle tracking, robotics for tracking objects in 3D space)
  • Sports analysis (analyzing and improving sports techniques, motion capture for animation and video games)
  • Retail and marketing (customer behavior analysis in stores, virtual fitting rooms for online shopping)
Add Your Heading Text Here
					from transformers import AutoFeatureExtractor, AutoModelForImageClassification

extractor = AutoFeatureExtractor.from_pretrained("microsoft/resnet-50")

model = AutoModelForImageClassification.from_pretrained("microsoft/resnet-50")