TLDR: MemoryTalker is a new AI model for creating realistic, personalized 3D facial animations from just audio input. It uses a two-stage training process: first, it learns general facial movements, and then it refines these movements with audio-guided stylization to capture individual speaking styles. Unlike previous methods, it doesn’t need extra information like speaker IDs or 3D facial meshes during animation, making it highly practical for applications like VR and gaming.
Creating realistic 3D facial animations from speech has long been a complex challenge in computer graphics and artificial intelligence. While existing methods have made strides, they often struggle with capturing the unique speaking style of an individual or require additional information, such as class labels for speakers or extra 3D facial meshes, during the animation process. This limits their practical use in real-world applications like virtual reality (VR) telepresence or character animation for films and games.
Addressing these limitations, researchers have introduced a novel framework called MemoryTalker. This innovative model is designed to synthesize realistic and accurate 3D facial motion sequences solely from audio input, effectively reflecting a speaker’s unique speaking style without needing any prior information at the time of animation. This makes MemoryTalker significantly more practical for various applications.
MemoryTalker employs a clever two-stage training strategy. The first stage, termed ‘Memorizing,’ focuses on storing and retrieving general facial motions. During this phase, the model learns common facial movements associated with specific sounds or words, regardless of who is speaking. For instance, when different people say the word “who,” their lips generally form similar initial and final shapes. MemoryTalker captures these consistent movements by using text representations derived from an Automatic Speech Recognition (ASR) model to access a ‘motion memory’ that stores these general facial features.
The second stage, ‘Animating,’ is where the personalization magic happens. Here, the model refines the learned general motion memory to synthesize personalized facial animations. This is achieved by guiding the model with audio-driven speaking style features. MemoryTalker learns to distinguish subtle characteristics in a speaker’s voice, such as volume, pitch, and speaking speed, which influence facial movements like the amplitude of mouth opening or the extent of pouting. By incorporating these distinct style features, the model can generate animations that truly match an individual’s unique way of speaking.
A key advantage of MemoryTalker is its ability to operate without requiring any prior knowledge, such as speaker identity classes or additional 3D facial mesh sequences, during the inference (animation) phase. This is a significant improvement over many previous methods that either couldn’t handle unseen speakers or demanded resource-intensive extra data inputs, making them less practical for real-time applications.
Quantitative evaluations demonstrate MemoryTalker’s superior performance compared to state-of-the-art methods across various metrics, including Face Vertex Error (FVE), Lip Vertex Error (LVE), and Lip Dynamic Time Warping (LDTW). These metrics confirm that MemoryTalker produces animations with lower prediction errors and higher temporal similarity in the lip region. Furthermore, the model is computationally efficient, boasting faster inference times and fewer parameters than many competitors, making it suitable for high-performance demands.
Qualitative results visually confirm these improvements, showing that MemoryTalker generates more accurate mouth shapes and better captures nuanced movements like pouting, closely matching ground-truth references. User studies also yielded favorable results, with participants consistently preferring MemoryTalker’s animations for their lip-sync accuracy, realism, and ability to reflect individual speaking styles.
Also Read:
- Enhancing 3D Facial Animation with Context-Aware Speech Modeling
- Detecting Deepfakes: A New Approach Using Facial Movement Analysis
In essence, MemoryTalker represents a significant leap forward in speech-driven 3D facial animation. By effectively bridging the gap between 3D facial motion and speech using a novel two-stage memory network, it offers a practical and high-performing solution for generating personalized animations from audio alone. This breakthrough paves the way for more immersive and realistic experiences in emerging technologies like VR and the metaverse. You can find more details about this research at the project page.


