TLDR: HumanCM is a new framework for 3D human motion prediction that uses consistency models to achieve high-quality, one-step generation. It significantly reduces inference time (up to two orders of magnitude faster) compared to traditional multi-step diffusion models, while maintaining comparable or superior accuracy on benchmarks like Human3.6M and HumanEva-I, making real-time applications feasible.
Predicting how humans will move in the near future is a critical task for many advanced technologies, from robots interacting with people to self-driving cars navigating complex environments and creating immersive virtual worlds. This field, known as Human Motion Prediction (HMP), aims to forecast future 3D human poses based on observed motion sequences.
Traditionally, deep generative models have made significant strides in making these predictions more realistic and diverse. Among these, diffusion-based approaches have shown remarkable success in generating natural and continuous motion trajectories. However, these methods come with a significant drawback: they require many iterative steps—sometimes tens or even hundreds—to refine their predictions. This process is computationally intensive and slow, making them unsuitable for applications where real-time responsiveness is crucial, such as interactive agents or augmented/virtual reality systems.
Addressing this challenge, researchers Haojie Liu and Suixiang Gao from the University of Chinese Academy of Sciences have introduced a groundbreaking framework called HumanCM. This innovative system is designed for one-step human motion prediction, drastically cutting down the time and computational resources needed.
HumanCM is built upon the concept of Consistency Models (CM), a relatively new paradigm in generative modeling. Unlike diffusion models that rely on a multi-step denoising process, consistency models learn a direct, self-consistent mapping between a noisy motion state and its clean, predicted future state. This allows HumanCM to generate high-quality motion predictions in a single forward pass, eliminating the iterative refinement bottleneck.
The framework employs a Transformer-based architecture, which is excellent at understanding long-range dependencies, both across different body joints (spatial) and over time (temporal). To further enhance its capabilities, HumanCM integrates temporal embeddings, helping it to maintain motion coherence and structural integrity throughout the prediction. Additionally, the training process is stabilized and semantic fidelity is enforced through a reconstruction-guided objective, ensuring that the generated motions are not only consistent but also realistic and true to the underlying data.
The impact of HumanCM’s efficiency is substantial. While existing diffusion-based models like MotionDiff, HumanMAC, and TransFusion typically require 10 to 100 sampling steps, HumanCM achieves its predictions in just one step. This translates to a dramatic reduction in generation time, making it over two orders of magnitude faster than its diffusion-based counterparts, as illustrated in their research. For instance, HumanCM can generate motion in approximately 0.66 seconds, compared to over 30 seconds for some other models.
Despite this significant acceleration, HumanCM does not compromise on accuracy. Extensive experiments conducted on widely used benchmarks, Human3.6M and HumanEva-I, demonstrate that HumanCM achieves comparable or even superior accuracy to state-of-the-art diffusion models. It shows excellent performance in metrics like Average Displacement Error (ADE) and Final Displacement Error (FDE), which measure prediction accuracy and long-term trajectory coherence.
The development of HumanCM marks a significant advancement in the field of human motion prediction. By distilling the complex diffusion process into a lightweight, one-step generator, it paves the way for real-time human motion forecasting in various latency-sensitive applications. This research highlights the immense potential of consistency models as a powerful and efficient alternative to traditional diffusion frameworks for spatiotemporal generation tasks.
Also Read:
- Efficient One-Step Generation with Di-Bregman Diffusion Distillation
- SoftMimic: Enabling Humanoid Robots to Interact Gently and Safely
For more technical details, you can refer to the full research paper: HumanCM: One Step Human Motion Prediction.


