EXPOTION: Guiding Music Generation with Human Expressions

TLDR: EXPOTION is a new generative AI model that creates expressive and temporally accurate music by using human facial expressions, upper-body motion, and text prompts as controls. It employs parameter-efficient fine-tuning on a pre-trained text-to-music model and introduces a temporal smoothing strategy for precise video-music synchronization. Experiments show it significantly improves music quality, creativity, and alignment compared to existing methods, and it comes with a new 7-hour synchronized video-music dataset.

A groundbreaking new generative AI model named EXPOTION is set to transform how music is created, allowing for expressive and precisely synchronized music generation guided by human facial expressions, upper-body motion, and text prompts. Developed by researchers at the Mohamed bin Zayed University of Artificial Intelligence, EXPOTION offers a novel approach to multimodal music generation.

Current text-to-music models, while impressive, often fall short in providing the fine-grained temporal control and expressivity needed for real-world applications. EXPOTION addresses this by integrating visual cues, much like a conductor guides an orchestra, to produce music that accurately mirrors the emotional and expressive nuances of human gestures and facial movements.

How EXPOTION Works

At its core, EXPOTION utilizes a technique called Parameter-Efficient Fine-Tuning (PEFT) on a pre-trained text-to-music generation model. This allows the model to adapt to multimodal controls using a relatively small dataset, making the training process efficient. To ensure seamless synchronization between video and music, the researchers introduced a unique temporal smoothing strategy, aligning multiple modalities with high precision.

The model incorporates visual features through a joint embedding encoder, which processes both facial expressions and upper-body movements. Facial expression features are extracted using MARLIN, a self-supervised learning framework, while motion features can be derived from either Synchformer or RAFT optical flow. These visual inputs are then temporally aligned and projected into a lower-dimensional space before being fused with positional embeddings. A condition adaptor then integrates these learned visual embeddings into the MusicGen decoder, allowing the model to generate music that is not only high-quality but also deeply reflective of the visual input.

A New Dataset for Expressive Music

Recognizing the scarcity of suitable paired video-audio data, the team curated a novel dataset consisting of 7 hours of synchronized video recordings. Volunteers were asked to record their facial expressions and upper-body movements while listening to 30-second instrumental audio clips across various genres. This unique dataset provides significant potential for future research in interactive and multimodal music generation.

Also Read:

Performance and Impact

Experiments demonstrate that EXPOTION significantly enhances the overall quality of generated music. It excels in musicality, creativity, beat-tempo consistency, temporal alignment with video, and text adherence. The model consistently outperforms both proposed baselines and existing state-of-the-art video-to-music generation models like VidMuse and Video2Music.

Objective evaluations showed that configurations incorporating motion information, especially using Syncformer features with generated text prompts, yielded the best results in terms of music quality. Subjective evaluations, involving participants with varying musical backgrounds, confirmed that EXPOTION’s models produced music perceived as more musical and creative than the baseline, highlighting the complementary strengths of visual and textual modalities.

Ablation studies further revealed that Syncformer motion features generally lead to more realistic and diverse music, while generic text prompts, surprisingly, sometimes resulted in better tempo accuracy compared to detailed generated captions, suggesting that too much textual context might occasionally interfere with purely visual motion cues.

In conclusion, EXPOTION represents a significant leap forward in controllable and expressive music generation. By effectively leveraging human body movements and facial expressions as control signals, it empowers artists with a more intuitive and interactive approach to music creation. You can find more details about this research paper here: EXPOTION Research Paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

EXPOTION: Guiding Music Generation with Human Expressions

How EXPOTION Works

A New Dataset for Expressive Music

Performance and Impact

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates