Enhancing Speech Summarization in Multi-modal AI Models Through Advanced Training

TLDR: This research introduces a novel multi-stage reinforcement learning framework to significantly improve speech summarization capabilities in Multi-modal Large Language Models (MLLMs). The framework involves Supervised Finetuning (SFT) on large synthetic datasets, On-policy Knowledge Distillation (KD) from powerful text-based LLMs to bridge the modality gap, and Direct Preference Optimization (DPO) to reduce hallucinations and align with human preferences. The resulting model achieves substantial performance gains, outperforming larger MLLMs like GPT-4o-audio and narrowing the gap with state-of-the-art text-based LLMs, even demonstrating strong cross-lingual generalization despite English-only training.

Speech summarization, the process of generating concise and coherent text summaries directly from spoken input, is becoming increasingly vital in our world dominated by audio and audiovisual content. Imagine quickly grasping the key points of a long meeting, lecture, or podcast without listening to the entire recording. This capability significantly boosts accessibility, productivity, and information retrieval.

Traditionally, speech summarization relied on a two-step approach: first, converting speech to text using Automatic Speech Recognition (ASR), and then summarizing the text. However, this method often introduces errors from the ASR stage and struggles to capture important nuances like speaker emphasis or tone. More recently, end-to-end methods have emerged, aiming to generate summaries directly from speech, but these often lack strong instruction-following abilities and generalization.

The rise of Multi-modal Large Language Models (MLLMs) offers a promising path forward. These models extend the power of traditional Large Language Models (LLMs) to handle various input types, including audio. This is particularly beneficial for speech summarization, as acoustic signals carry not just words, but also paralinguistic information like emotion and prosody, which can lead to more accurate and contextually rich summaries. While commercial MLLMs like GPT-4o-Audio show great potential, their closed-source nature and large size limit widespread deployment. Open-source MLLMs, despite their advancements, still face a significant performance gap compared to their text-based counterparts, a challenge referred to as the “modality gap.”

A new research paper, authored by Shaoshi Ling, Gang Liu, Guoli Ye, and Jinyu Li from Microsoft CoreAI, USA, introduces a novel approach to tackle these limitations. Their work, titled “ADVANCING SPEECH SUMMARIZATION IN MULTI-MODAL LLMS WITH REINFORCEMENT LEARNING”, presents a multi-stage reinforcement learning (RL) training framework designed to significantly enhance speech summarization capabilities in MLLMs. You can read the full paper here.

A Three-Stage Training Framework

The researchers propose a three-stage training process to improve MLLMs for speech summarization:

1. Supervised Finetuning (SFT) on Synthetic Data: The first stage focuses on building a strong foundation for the model’s ability to follow instructions and generate useful summaries. They created a large and diverse synthetic dataset specifically tailored for summarization. This dataset, significantly larger and richer than previous efforts, helps the model understand and respond to a wide range of summarization requests, from short summaries to structured outputs like bullet points or email-style narratives.

2. On-policy Knowledge Distillation (KD): This stage is crucial for bridging the performance gap between audio-conditioned MLLMs and powerful text-based LLMs. Knowledge distillation typically involves transferring knowledge from a large ‘teacher’ model to a smaller ‘student’ model. However, direct distillation can be problematic due to the differences in how text and audio models generate output. To overcome this, the researchers adopted an “on-policy” distillation strategy. Here, the student MLLM generates its own summaries from audio input, and the text-based teacher model (like GPT-4o in text-only mode) provides feedback on these generated sequences. This method helps the student learn from its own mistakes, leading to more stable training, better generalization, and effective transfer of the teacher’s linguistic competence to the audio domain.

3. Direct Preference Optimization (DPO): While SFT and KD greatly improve summarization, they can sometimes lead to undesirable outputs, such as repetitive phrases or factual inaccuracies (hallucinations). To address these issues, the final stage employs Direct Preference Optimization. DPO uses pairwise preference data, where GPT-4.1 evaluates two summaries generated by the student model and expresses a preference. By learning from these human-like quality judgments, DPO helps the model reduce hallucinations and improve the overall consistency and reliability of its summaries.

Also Read:

Impressive Results and Generalization

The results of this multi-stage framework are compelling. The final model achieves a substantial 28% relative improvement over strong baselines. Remarkably, it outperforms much larger state-of-the-art MLLMs, including GPT-4o-audio, and significantly narrows the performance gap with leading text-based LLMs. An interesting finding is the model’s strong generalization across different languages, even though it was trained exclusively on English data. On the multilingual Floras benchmark, the model maintained performance close to GPT-4o-Audio, demonstrating robust zero-shot cross-lingual transfer.

This research highlights that with a carefully designed training approach—combining high-quality synthetic data, effective knowledge transfer, and preference alignment—smaller, open-source models can achieve performance comparable to, or even surpass, much larger commercial systems in speech summarization. Future work aims to incorporate speaker-aware summarization and leverage time-aligned information for even greater temporal coherence.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Speech Summarization in Multi-modal AI Models Through Advanced Training

A Three-Stage Training Framework

Impressive Results and Generalization

Gen AI News and Updates

Ming-UniAudio: A Unified AI Model for Comprehensive Speech Tasks

Tailoring Image Edits: A Collaborative Approach to User Preferences in AI

TabDistill: Bridging Transformer Power and Neural Network Efficiency for Tabular Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates