TLDR: Emo-FiLM is a new framework for emotional text-to-speech that enables fine-grained, word-level control over emotions, moving beyond traditional sentence-level approaches. It uses emotion2vec to annotate word-level emotions and a Feature-wise Linear Modulation (FiLM) layer to dynamically inject these emotions into speech synthesis. Experiments show Emo-FiLM significantly improves emotion similarity and dynamic matching, making synthesized speech more natural and expressive.
Emotional text-to-speech (E-TTS) systems are becoming increasingly important for creating more natural and trustworthy interactions between humans and computers. These systems are used in various applications, from voice assistants to virtual characters, aiming to make digital voices more engaging and immersive.
Traditionally, most E-TTS methods control emotion at a sentence level. This means an entire sentence is spoken with a single, overarching emotion, such as “happy,” “sad,” or “angry.” While effective for expressing a global mood, these approaches struggle to capture the subtle, dynamic shifts in emotion that often occur within a single sentence in natural human speech. For instance, a sentence might start with a surprised tone and then transition to joy, a complexity that global emotion controls cannot easily convey.
Introducing Emo-FiLM: Fine-Grained Emotion Control
To overcome this limitation, researchers have introduced Emo-FiLM, a novel framework designed for fine-grained, word-level emotional speech synthesis. Emo-FiLM moves beyond global emotion control by enabling dynamic modulation of emotion at the individual word level, leading to more expressive and natural-sounding speech.
The Emo-FiLM framework operates in two main stages: Fine-grained Emotion Annotation and Emotion-modulated Generation.
How Emo-FiLM Works
First, for Fine-grained Emotion Annotation, the system uses an advanced model called emotion2vec to extract detailed emotion features from speech at a very granular, frame-level. These frame-level features are then carefully aligned with individual words in the text. This process allows Emo-FiLM to generate precise, word-level emotion annotations, including both discrete emotion categories (like happy, sad) and continuous intensity levels (how strong the emotion is). To support the evaluation of this fine-grained control, a new dataset called the Fine-grained Emotion Dynamics Dataset (FEDD) was constructed, specifically designed with detailed annotations of emotional transitions.
Second, for Emotion-modulated Generation, Emo-FiLM integrates an Emotion Feature-wise Linear Modulation (E-FiLM) module into a pre-trained Large Language Model (LLM)-based Text-to-Speech framework. This module takes the word-level emotion signals and transforms them into scaling and shifting parameters. These parameters then modulate the text embeddings, effectively injecting the desired emotional nuances directly into the speech generation process at the word level. This innovative approach allows for dynamic changes in prosody and emotion throughout a sentence, something global control methods cannot achieve.
Also Read:
- UniSS: Advancing Speech-to-Speech Translation with Voice and Emotion Preservation
- CoMelSinger: Advancing Zero-Shot Singing Synthesis with Precise Melody Control
Key Advantages and Performance
Experiments conducted on both global emotion synthesis tasks (using the ESD dataset) and fine-grained dynamic emotion tasks (using the new FEDD dataset) demonstrated that Emo-FiLM significantly outperforms existing approaches. On global tasks, Emo-FiLM showed superior emotion similarity and maintained high intelligibility. More importantly, on dynamic tasks, it achieved substantial improvements in emotion dynamic matching and received higher subjective ratings for both emotion similarity and naturalness. This confirms Emo-FiLM’s ability to effectively capture and generate complex emotional transitions within speech.
An ablation study further validated the importance of each component within Emo-FiLM, showing that fine-grained word-level data, the auxiliary emotion loss, and the FiLM layer itself are all critical for its superior performance. Visualizations of speech characteristics, such as pitch contours, also showed that Emo-FiLM generates F0 contours that closely match ground truth, reproducing both overall prosody and subtle local fluctuations corresponding to emotional shifts.
In conclusion, Emo-FiLM represents a significant step forward in emotional speech synthesis by enabling precise, word-level control over emotional expression. This advancement promises to make human-computer interactions more natural, expressive, and trustworthy. You can read the full research paper for more details: Beyond Global Emotion: Fine-Grained Emotional Speech Synthesis with Dynamic Word-Level Modulation.


