TLDR: X-UniMotion is a novel AI system that animates still human images using motion from a video, even if the subjects are different. It achieves high fidelity and preserves identity by encoding whole-body motion (face, body, hands) into “identity-agnostic” latent tokens, moving beyond traditional explicit pose methods. This allows for detailed, expressive animations and opens possibilities for video outpainting.
A new research paper introduces X-UniMotion, an innovative approach to animating human images with remarkable expressiveness and fidelity. This technology tackles the complex challenge of transferring motion from one person’s video to a still image of another, ensuring that the animated character retains its original identity while performing intricate movements.
Traditional methods for animating human images often rely on explicit skeletal poses, which can struggle with capturing subtle details like facial expressions or hand gestures. These methods also frequently entangle identity-specific traits with the motion, leading to issues where the animated character might lose its original appearance. X-UniMotion addresses these limitations by moving away from explicit pose inputs.
The Core Idea: Identity-Agnostic Motion Latents
At the heart of X-UniMotion is a unique “implicit latent representation” for whole-body human motion. Instead of using visible skeletal points, the system encodes motion directly from a single image into a compact set of four “disentangled latent tokens.” These tokens represent facial expressions, body poses, and individual hand gestures. The key here is that these motion tokens are designed to be “identity-agnostic,” meaning they capture the movement itself without being tied to the specific appearance or structure of the person performing the motion.
This approach allows for high-fidelity, detailed motion transfer across different individuals, even when they have distinct body shapes, poses, or spatial arrangements. The system can capture everything from subtle facial twitches and intricate finger movements to complex body articulations, all while being robust to challenges like occlusions or varying lighting conditions.
How X-UniMotion Works
The technology operates within a self-supervised, end-to-end training framework. It jointly learns a motion encoder and a video generative model based on a Diffusion-Transformer (DiT) architecture. Here’s a simplified breakdown:
- Encoding Motion: A motion encoder takes a driving video frame and distills its motion into a low-dimensional latent descriptor. This descriptor is global, focusing purely on motion without leaking identity details.
- Disentangling Identity: To ensure the motion is truly identity-agnostic, the system uses clever techniques. It applies 2D augmentations (like color changes and spatial distortions) to the driving images. More uniquely, it also uses synthetic 3D renderings of different characters performing the same poses, but with varied body proportions. This helps the model learn to separate motion from identity-specific features like face shape or body size.
- Localized Details: While a global motion descriptor handles the overall movement, X-UniMotion introduces additional localized descriptors specifically for the face and each hand. This allows the system to capture fine-grained details like individual finger movements or nuanced facial expressions, which are often missed by other methods.
- Guided Learning: To further enhance the quality and semantic understanding of the motion tokens, the system uses “dual decoders.” These auxiliary components provide explicit guidance during training, helping the model accurately represent joint positions and hand normal maps, which are crucial for realistic depth and articulation.
Also Read:
- Crafting Expressive Digital Avatars with Unwavering Identity
- Crafting Dynamic Dialogue: A New AI Framework for Over-the-Shoulder Video Scenes
Performance and Applications
Extensive experiments show that X-UniMotion outperforms existing state-of-the-art methods in terms of motion accuracy, identity preservation, and overall visual quality. It excels in challenging “cross-identity reenactment” scenarios, where the reference image and driving video feature vastly different individuals. Unlike methods that rely on 2D skeletons, X-UniMotion handles complex poses, depth ambiguities (like crossing limbs), and fine-grained expressions with superior results.
Beyond animating still images, the unified motion representation developed by X-UniMotion also opens doors for other applications, such as video outpainting, where the system can predict and generate continuous motion sequences to extend a video.
While X-UniMotion currently focuses on single-person human or anthropomorphic character animation, the researchers envision future work extending it to multi-person scenarios, human-object interactions, and even adapting it for animating non-human subjects like animals. This research marks a significant step forward in creating highly expressive and identity-preserving digital human animations. You can read the full research paper at arXiv.org.


