TLDR: This research introduces a novel multi-stage reinforcement learning framework to significantly improve speech summarization capabilities in Multi-modal Large Language Models (MLLMs). The framework involves Supervised Finetuning (SFT) on large synthetic datasets, On-policy Knowledge Distillation (KD) from powerful text-based LLMs to bridge the modality gap, and Direct Preference Optimization (DPO) to reduce hallucinations and align with human preferences. The resulting model achieves substantial performance gains, outperforming larger MLLMs like GPT-4o-audio and narrowing the gap with state-of-the-art text-based LLMs, even demonstrating strong cross-lingual generalization despite English-only training.
Speech summarization, the process of generating concise and coherent text summaries directly from spoken input, is becoming increasingly vital in our world dominated by audio and audiovisual content. Imagine quickly grasping the key points of a long meeting, lecture, or podcast without listening to the entire recording. This capability significantly boosts accessibility, productivity, and information retrieval.
Traditionally, speech summarization relied on a two-step approach: first, converting speech to text using Automatic Speech Recognition (ASR), and then summarizing the text. However, this method often introduces errors from the ASR stage and struggles to capture important nuances like speaker emphasis or tone. More recently, end-to-end methods have emerged, aiming to generate summaries directly from speech, but these often lack strong instruction-following abilities and generalization.
The rise of Multi-modal Large Language Models (MLLMs) offers a promising path forward. These models extend the power of traditional Large Language Models (LLMs) to handle various input types, including audio. This is particularly beneficial for speech summarization, as acoustic signals carry not just words, but also paralinguistic information like emotion and prosody, which can lead to more accurate and contextually rich summaries. While commercial MLLMs like GPT-4o-Audio show great potential, their closed-source nature and large size limit widespread deployment. Open-source MLLMs, despite their advancements, still face a significant performance gap compared to their text-based counterparts, a challenge referred to as the “modality gap.”
A new research paper, authored by Shaoshi Ling, Gang Liu, Guoli Ye, and Jinyu Li from Microsoft CoreAI, USA, introduces a novel approach to tackle these limitations. Their work, titled “ADVANCING SPEECH SUMMARIZATION IN MULTI-MODAL LLMS WITH REINFORCEMENT LEARNING”, presents a multi-stage reinforcement learning (RL) training framework designed to significantly enhance speech summarization capabilities in MLLMs. You can read the full paper here.
A Three-Stage Training Framework
The researchers propose a three-stage training process to improve MLLMs for speech summarization:
1. Supervised Finetuning (SFT) on Synthetic Data: The first stage focuses on building a strong foundation for the model’s ability to follow instructions and generate useful summaries. They created a large and diverse synthetic dataset specifically tailored for summarization. This dataset, significantly larger and richer than previous efforts, helps the model understand and respond to a wide range of summarization requests, from short summaries to structured outputs like bullet points or email-style narratives.
2. On-policy Knowledge Distillation (KD): This stage is crucial for bridging the performance gap between audio-conditioned MLLMs and powerful text-based LLMs. Knowledge distillation typically involves transferring knowledge from a large ‘teacher’ model to a smaller ‘student’ model. However, direct distillation can be problematic due to the differences in how text and audio models generate output. To overcome this, the researchers adopted an “on-policy” distillation strategy. Here, the student MLLM generates its own summaries from audio input, and the text-based teacher model (like GPT-4o in text-only mode) provides feedback on these generated sequences. This method helps the student learn from its own mistakes, leading to more stable training, better generalization, and effective transfer of the teacher’s linguistic competence to the audio domain.
3. Direct Preference Optimization (DPO): While SFT and KD greatly improve summarization, they can sometimes lead to undesirable outputs, such as repetitive phrases or factual inaccuracies (hallucinations). To address these issues, the final stage employs Direct Preference Optimization. DPO uses pairwise preference data, where GPT-4.1 evaluates two summaries generated by the student model and expresses a preference. By learning from these human-like quality judgments, DPO helps the model reduce hallucinations and improve the overall consistency and reliability of its summaries.
Also Read:
- Enhancing Speech Synthesis Stability in LLM Models Through Attention Mechanisms
- Boosting Language Model Reasoning with Reinforcement Learning on Pre-Training Data
Impressive Results and Generalization
The results of this multi-stage framework are compelling. The final model achieves a substantial 28% relative improvement over strong baselines. Remarkably, it outperforms much larger state-of-the-art MLLMs, including GPT-4o-audio, and significantly narrows the performance gap with leading text-based LLMs. An interesting finding is the model’s strong generalization across different languages, even though it was trained exclusively on English data. On the multilingual Floras benchmark, the model maintained performance close to GPT-4o-Audio, demonstrating robust zero-shot cross-lingual transfer.
This research highlights that with a carefully designed training approach—combining high-quality synthetic data, effective knowledge transfer, and preference alignment—smaller, open-source models can achieve performance comparable to, or even surpass, much larger commercial systems in speech summarization. Future work aims to incorporate speaker-aware summarization and leverage time-aligned information for even greater temporal coherence.


