TLDR: This research introduces Joint Angle-based Refinement (JAR), a novel method to improve the accuracy and stability of marker-free human pose estimation (HPE). JAR addresses issues like keypoint recognition errors and trajectory jitters by modeling human poses using joint angles, approximating their temporal variations with Fourier series to create high-quality training data, and employing a BiGRU-Attention network for post-processing. The method significantly outperforms state-of-the-art refinement networks, especially in complex activities, and can also correct inconsistencies in existing video datasets, making HPE more reliable for kinematic analysis.
Human pose estimation (HPE) is a powerful technology used in many fields, from human-computer interaction to sports analysis and healthcare. It helps determine how a human body is configured from images and videos. However, current HPE methods often struggle with two main issues: occasional errors in recognizing key body points (like elbows or knees) and random fluctuations in the paths these key points trace over time. These problems can significantly affect the accuracy of motion analysis, especially when calculating things like speed or acceleration.
Existing deep learning models designed to refine HPE outputs are often limited because they rely on training datasets where key points are manually marked. This manual annotation can introduce inconsistencies, especially in videos showing continuous human motion, leading to less reliable results.
Introducing Joint Angle-based Refinement (JAR)
A new method called Joint Angle-based Refinement (JAR) has been proposed to overcome these challenges. JAR focuses on modeling human poses using joint angles, which are more robust to changes in camera perspective or distance. This approach helps create a more consistent and accurate description of human movement.
How JAR Works: Key Techniques
The JAR method incorporates several key techniques:
First, it uses a **joint angle-based model** of human pose. Instead of just tracking keypoint coordinates, it derives angles between body segments. This makes the model more stable and less affected by how the video is shot. For instance, it uses the ‘nose’ as a stable reference point and calculates angles from there.
Second, to get reliable ‘ground truth’ data for training, JAR approximates the temporal variation of joint angles using **high-order Fourier series**. This mathematical technique helps describe the periodic nature of human joint movements, like those seen in running. By fitting parameters from existing datasets, it ensures that the generated training data is spatiotemporally consistent and continuous, mimicking natural human motion.
Third, a **bidirectional recurrent network with an attention mechanism (BiGRU-Attention)** is designed as a post-processing module. This network is trained with the high-quality dataset generated by the Fourier series approximation. Its role is to refine the initial pose estimations from well-established models like HRNet, correcting wrongly recognized joints and smoothing their trajectories over time.
Training and Performance
The training dataset for JAR is generated by adjusting Fourier series parameters to simulate individual differences in motion, segmenting these variations using a sliding window, and then adding synthetic noise and outliers. This robust training process helps the model tolerate anomalies in real-world data.
JAR operates in three stages: initial pose estimation (e.g., by HRNet), transformation of keypoints into joint angles, smoothing of these joint angle sequences using the BiGRU-Attention model, and finally, reconstructing the refined keypoint positions from the smoothed angles.
Experimental results show that JAR significantly outperforms state-of-the-art HPE refinement networks like SmoothNet, especially in challenging scenarios such as figure skating and breaking. For example, in sprint and standing triple jump cases, JAR achieved outlier correction rates of 95.61% and 100% respectively, substantially higher than SmoothNet’s performance. JAR also produces much smoother and more physiologically consistent velocity curves, which is crucial for accurate kinematic analysis.
The research also evaluated various sequence-to-sequence models for the smoothing task, confirming that BiGRU-Attention offers a balanced performance, robustness, and computational efficiency, making it the optimal choice for JAR.
Also Read:
- New AI Framework Reconstructs Dynamic Human-Object Interactions from Single-Camera Video
- ProGait: A New Dataset Enhances AI Understanding of Prosthetic Gait
Broader Impact
Beyond refining real-time pose estimations, JAR can also be used to rectify existing video datasets, minimizing inconsistencies caused by manual annotations. This capability can significantly enhance the reliability of training datasets for future HPE models, leading to overall improvements in human motion analysis technology.
For more technical details, you can refer to the full research paper: Joint angle model based learning to refine kinematic human pose estimation.


