Improving How AI Understands Spoken Language Through GRPO

TLDR: A new research paper introduces a Group Relative Policy Optimization (GRPO)-based method to train Speech-Aware Large Language Models (SALLMs) for open-format speech understanding tasks like Spoken Question Answering (SQA) and Automatic Speech Translation (AST). By using BLEU as a reward signal, this approach empirically outperforms standard supervised fine-tuning (SFT) across multiple metrics and scales to larger models. The research also explores mixed-policy GRPO and identifies BLEU as the most effective reward function for these tasks.

In the rapidly evolving field of artificial intelligence, Speech-Aware Large Language Models (SALLMs) are becoming increasingly vital for understanding spoken language. These models are designed to take both speech and text as input and generate text outputs, making them highly effective for tasks such as Automatic Speech Recognition (ASR), Automatic Speech Translation (AST), and Spoken Question Answering (SQA).

A recent research paper from IBM Research introduces a novel approach to significantly enhance the capabilities of SALLMs, particularly in handling open-ended speech understanding tasks. The method, based on Group Relative Policy Optimization (GRPO), leverages a sophisticated reinforcement learning technique to train these models more effectively than traditional methods.

Understanding the GRPO Approach

Reinforcement Learning (RL) has proven to be a powerful tool for improving the reasoning abilities of various AI models. Inspired by these advancements, the researchers aimed to apply RL to SALLMs. While previous attempts often relied on binary rewards or unsupervised methods, which sometimes yielded suboptimal results, this new work proposes an RL approach with verifiable rewards (RLVR).

The core of their method lies in the GRPO algorithm, which is an on-policy reinforcement learning algorithm. Unlike some other RL methods that require a separate reward model, GRPO learns from its own generated data. It samples multiple responses from the model for a given prompt, calculates a reward for each, and then uses these rewards to optimize the model’s policy, increasing the likelihood of generating high-reward responses.

BLEU as a Reward Signal

A key innovation in this research is the use of the BLEU metric as a reward signal. BLEU (Bilingual Evaluation Understudy) is a widely recognized metric for evaluating the quality of machine-translated text by comparing it to a set of high-quality reference translations. The researchers found BLEU to be highly suitable for generative, open-ended tasks where multiple valid answers might exist, such as SQA and AST. By using BLEU to compare the model’s generated text with a ground-truth reference, the model learns to produce more accurate and relevant responses.

Empirical Success in Open-Ended Tasks

The team rigorously evaluated their GRPO-based approach on two critical speech understanding tasks: Spoken Question Answering (SQA) using the LibriSQA dataset and Automatic Speech Translation (AST) from English to German using the CoVoST2 dataset. They compared their method against standard Supervised Fine-Tuning (SFT) and baseline models.

The results were compelling. For SQA, the GRPO approach significantly outperformed both the baseline and SFT models, achieving substantial improvements in metrics like BLEU, BERTScore, ROUGE, and METEOR. For instance, the Granite Speech 2B model showed a 61.8% BLEU improvement over the base model and 9.8% over SFT. Similarly, for AST, GRPO delivered stronger results, with the Granite Speech 2B model showing an 8.2% BLEU improvement over the base and 3.2% over SFT. Notably, for the larger Granite Speech 8B model, GRPO continued to provide gains even where SFT showed degraded performance.

Exploring Mixed-Policy GRPO

The researchers also delved into Mixed-Policy GRPO (MP-GRPO), an extension that incorporates both on-policy (model-generated) and off-policy (pre-existing, high-quality) samples into the training group. In their experiments, they explored adding the ground-truth reference as an off-policy sample. While MP-GRPO showed promise for AST, further improving BLEU scores, its performance degraded for SQA. This difference might be attributed to the base model’s prior training, suggesting that the effectiveness of off-policy samples can vary depending on the task and the model’s initial knowledge.

Also Read:

Conclusion and Future Directions

This work demonstrates a highly effective and relatively simple method for training Speech-Aware Large Language Models using GRPO with BLEU as a reward function. The approach significantly improves SALLMs’ performance on open-ended SQA and AST tasks, outperforming standard supervised fine-tuning and showing scalability to larger models. The researchers hope this work will inspire further exploration into on-policy, off-policy, and mixed-policy algorithms for various speech understanding challenges. You can read the full research paper here: Advancing Speech Understanding in Speech-Aware Language Models with GRPO.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Improving How AI Understands Spoken Language Through GRPO

Understanding the GRPO Approach

BLEU as a Reward Signal

Empirical Success in Open-Ended Tasks

Exploring Mixed-Policy GRPO

Conclusion and Future Directions

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates