TLDR: EXPOTION is a new generative AI model that creates expressive and temporally accurate music by using human facial expressions, upper-body motion, and text prompts as controls. It employs parameter-efficient fine-tuning on a pre-trained text-to-music model and introduces a temporal smoothing strategy for precise video-music synchronization. Experiments show it significantly improves music quality, creativity, and alignment compared to existing methods, and it comes with a new 7-hour synchronized video-music dataset.
A groundbreaking new generative AI model named EXPOTION is set to transform how music is created, allowing for expressive and precisely synchronized music generation guided by human facial expressions, upper-body motion, and text prompts. Developed by researchers at the Mohamed bin Zayed University of Artificial Intelligence, EXPOTION offers a novel approach to multimodal music generation.
Current text-to-music models, while impressive, often fall short in providing the fine-grained temporal control and expressivity needed for real-world applications. EXPOTION addresses this by integrating visual cues, much like a conductor guides an orchestra, to produce music that accurately mirrors the emotional and expressive nuances of human gestures and facial movements.
How EXPOTION Works
At its core, EXPOTION utilizes a technique called Parameter-Efficient Fine-Tuning (PEFT) on a pre-trained text-to-music generation model. This allows the model to adapt to multimodal controls using a relatively small dataset, making the training process efficient. To ensure seamless synchronization between video and music, the researchers introduced a unique temporal smoothing strategy, aligning multiple modalities with high precision.
The model incorporates visual features through a joint embedding encoder, which processes both facial expressions and upper-body movements. Facial expression features are extracted using MARLIN, a self-supervised learning framework, while motion features can be derived from either Synchformer or RAFT optical flow. These visual inputs are then temporally aligned and projected into a lower-dimensional space before being fused with positional embeddings. A condition adaptor then integrates these learned visual embeddings into the MusicGen decoder, allowing the model to generate music that is not only high-quality but also deeply reflective of the visual input.
A New Dataset for Expressive Music
Recognizing the scarcity of suitable paired video-audio data, the team curated a novel dataset consisting of 7 hours of synchronized video recordings. Volunteers were asked to record their facial expressions and upper-body movements while listening to 30-second instrumental audio clips across various genres. This unique dataset provides significant potential for future research in interactive and multimodal music generation.
Also Read:
- AI Model Learns to Compose and Perform Classical Piano with Expressive Nuances
- Interactive Sound Generation: Click on Objects, Hear the Audio
Performance and Impact
Experiments demonstrate that EXPOTION significantly enhances the overall quality of generated music. It excels in musicality, creativity, beat-tempo consistency, temporal alignment with video, and text adherence. The model consistently outperforms both proposed baselines and existing state-of-the-art video-to-music generation models like VidMuse and Video2Music.
Objective evaluations showed that configurations incorporating motion information, especially using Syncformer features with generated text prompts, yielded the best results in terms of music quality. Subjective evaluations, involving participants with varying musical backgrounds, confirmed that EXPOTION’s models produced music perceived as more musical and creative than the baseline, highlighting the complementary strengths of visual and textual modalities.
Ablation studies further revealed that Syncformer motion features generally lead to more realistic and diverse music, while generic text prompts, surprisingly, sometimes resulted in better tempo accuracy compared to detailed generated captions, suggesting that too much textual context might occasionally interfere with purely visual motion cues.
In conclusion, EXPOTION represents a significant leap forward in controllable and expressive music generation. By effectively leveraging human body movements and facial expressions as control signals, it empowers artists with a more intuitive and interactive approach to music creation. You can find more details about this research paper here: EXPOTION Research Paper.


