TLDR: Step2Motion is a novel deep learning method that reconstructs detailed human locomotion using only data from smart insoles equipped with pressure sensors and IMUs. It overcomes limitations of traditional motion capture systems by providing accurate full-body motion reconstruction for diverse activities, from walking to dancing, in real-world environments. The system utilizes a diffusion model for poses and a separate Transformer for root displacement, employing a unique multi-head cross-attention mechanism to effectively integrate multi-modal sensor data.
Human motion is a complex interplay of forces between our feet and the ground, providing vital clues for understanding and recreating how we move. Traditional motion capture systems, like those using optical cameras or full-body suits, often come with limitations such as high cost, complex setups, line-of-sight issues, or restricted movement. These challenges make them less ideal for capturing natural movement in everyday, unconstrained environments, especially outdoors.
Addressing this gap, researchers have introduced a groundbreaking approach called Step2Motion. This is the first method designed to reconstruct comprehensive human locomotion using only data from multi-modal smart insoles. These insoles are discreetly worn inside shoes and are equipped with both pressure sensors and Inertial Measurement Units (IMUs), offering a practical and unrestrictive solution for motion capture.
How Step2Motion Works
The Step2Motion system leverages two primary types of data from the insoles: pressure and inertial measurements. Each insole contains 16 pressure sensors distributed across the foot, measuring the force applied to different areas. Additionally, an IMU in each insole captures linear acceleration and angular rates, providing information about the foot’s movement and orientation. The system also records the total ground reaction force and the center of pressure (CoP) for each foot.
At its core, Step2Motion employs a deep learning architecture that combines a diffusion model for reconstructing detailed body poses and a separate Transformer network for predicting the overall root motion (displacement) of the body. The diffusion model is particularly effective at synthesizing high-quality, temporally consistent poses by progressively refining a noisy input. To make sense of the multi-modal insole data, the system uses a specialized ‘multi-head cross-attention’ mechanism. This allows the network to selectively focus on different sensor modalities – such as pressure from the toes or heel, or IMU data – depending on the specific body part being reconstructed and the type of movement.
For predicting the root displacement, a separate Transformer network is used. Interestingly, this network primarily relies on IMU data rather than pressure data. The researchers found that using only IMU data for displacement prediction helped prevent the model from overfitting to specific pressure patterns, leading to better generalization for unseen movements. A unique aspect of the displacement predictor’s training is the inclusion of a cumulative sum loss, which penalizes the accumulation of errors over time, ensuring more accurate long-term movement tracking.
Versatility and Performance
Step2Motion has been rigorously evaluated across a wide range of experiments, demonstrating its versatility for diverse locomotion styles. It can accurately reconstruct simple movements like walking and jogging, as well as more complex actions such as moving sideways, walking on tiptoes, slightly crouching, dancing, and even jumping. The system has been tested on both a publicly available dataset and a newly recorded dataset specifically designed for motion diversity.
Compared to traditional deep learning architectures like MLPs and standard Transformers, Step2Motion consistently shows superior performance in terms of pose accuracy and temporal consistency. The multi-head cross-attention mechanism proved crucial, allowing the model to adapt its focus based on the activity – prioritizing pressure data during stationary actions like squatting and IMU data during dynamic movements like walking.
In real-world ‘in-the-wild’ capture scenarios, Step2Motion demonstrated remarkable accuracy in tracking root displacement. For instance, in an experiment where a user jogged 60 meters, the system achieved a final drift of only about 0.75 meters (1.25% of the total distance), significantly outperforming other baseline methods.
The ability to combine both pressure and IMU data is particularly vital for reconstructing complex motions like dancing. While simpler movements might be partially reconstructed with one modality, dance requires the rich, complementary information from both to achieve meaningful and accurate pose reconstruction.
Also Read:
- Advanced Multi-Person Motion Tracking with Wearable Sensors and Ultra-Wideband Technology
- Vision Language Models Advance Human Activity Recognition in Healthcare
Future Directions
While Step2Motion marks a significant advancement, the researchers acknowledge certain limitations. These include the inherent drift associated with IMU sensors and the reduced accuracy for body parts far from the feet, such as the head and arms. Future work could explore integrating other sensor modalities, generating synthetic insole readings from existing motion data, or using motion priors trained on larger databases to further enhance accuracy and robustness.
Step2Motion represents a pivotal first step towards general locomotion reconstruction using only insole sensors. This technology holds immense potential for applications in sports analysis, rehabilitation, virtual reality, and entertainment, making high-quality motion capture more accessible and versatile in various environments. You can find the full research paper here.


