spot_img
HomeResearch & DevelopmentImproving How AI Understands Spoken Language Through GRPO

Improving How AI Understands Spoken Language Through GRPO

TLDR: A new research paper introduces a Group Relative Policy Optimization (GRPO)-based method to train Speech-Aware Large Language Models (SALLMs) for open-format speech understanding tasks like Spoken Question Answering (SQA) and Automatic Speech Translation (AST). By using BLEU as a reward signal, this approach empirically outperforms standard supervised fine-tuning (SFT) across multiple metrics and scales to larger models. The research also explores mixed-policy GRPO and identifies BLEU as the most effective reward function for these tasks.

In the rapidly evolving field of artificial intelligence, Speech-Aware Large Language Models (SALLMs) are becoming increasingly vital for understanding spoken language. These models are designed to take both speech and text as input and generate text outputs, making them highly effective for tasks such as Automatic Speech Recognition (ASR), Automatic Speech Translation (AST), and Spoken Question Answering (SQA).

A recent research paper from IBM Research introduces a novel approach to significantly enhance the capabilities of SALLMs, particularly in handling open-ended speech understanding tasks. The method, based on Group Relative Policy Optimization (GRPO), leverages a sophisticated reinforcement learning technique to train these models more effectively than traditional methods.

Understanding the GRPO Approach

Reinforcement Learning (RL) has proven to be a powerful tool for improving the reasoning abilities of various AI models. Inspired by these advancements, the researchers aimed to apply RL to SALLMs. While previous attempts often relied on binary rewards or unsupervised methods, which sometimes yielded suboptimal results, this new work proposes an RL approach with verifiable rewards (RLVR).

The core of their method lies in the GRPO algorithm, which is an on-policy reinforcement learning algorithm. Unlike some other RL methods that require a separate reward model, GRPO learns from its own generated data. It samples multiple responses from the model for a given prompt, calculates a reward for each, and then uses these rewards to optimize the model’s policy, increasing the likelihood of generating high-reward responses.

BLEU as a Reward Signal

A key innovation in this research is the use of the BLEU metric as a reward signal. BLEU (Bilingual Evaluation Understudy) is a widely recognized metric for evaluating the quality of machine-translated text by comparing it to a set of high-quality reference translations. The researchers found BLEU to be highly suitable for generative, open-ended tasks where multiple valid answers might exist, such as SQA and AST. By using BLEU to compare the model’s generated text with a ground-truth reference, the model learns to produce more accurate and relevant responses.

Empirical Success in Open-Ended Tasks

The team rigorously evaluated their GRPO-based approach on two critical speech understanding tasks: Spoken Question Answering (SQA) using the LibriSQA dataset and Automatic Speech Translation (AST) from English to German using the CoVoST2 dataset. They compared their method against standard Supervised Fine-Tuning (SFT) and baseline models.

The results were compelling. For SQA, the GRPO approach significantly outperformed both the baseline and SFT models, achieving substantial improvements in metrics like BLEU, BERTScore, ROUGE, and METEOR. For instance, the Granite Speech 2B model showed a 61.8% BLEU improvement over the base model and 9.8% over SFT. Similarly, for AST, GRPO delivered stronger results, with the Granite Speech 2B model showing an 8.2% BLEU improvement over the base and 3.2% over SFT. Notably, for the larger Granite Speech 8B model, GRPO continued to provide gains even where SFT showed degraded performance.

Exploring Mixed-Policy GRPO

The researchers also delved into Mixed-Policy GRPO (MP-GRPO), an extension that incorporates both on-policy (model-generated) and off-policy (pre-existing, high-quality) samples into the training group. In their experiments, they explored adding the ground-truth reference as an off-policy sample. While MP-GRPO showed promise for AST, further improving BLEU scores, its performance degraded for SQA. This difference might be attributed to the base model’s prior training, suggesting that the effectiveness of off-policy samples can vary depending on the task and the model’s initial knowledge.

Also Read:

Conclusion and Future Directions

This work demonstrates a highly effective and relatively simple method for training Speech-Aware Large Language Models using GRPO with BLEU as a reward function. The approach significantly improves SALLMs’ performance on open-ended SQA and AST tasks, outperforming standard supervised fine-tuning and showing scalability to larger models. The researchers hope this work will inspire further exploration into on-policy, off-policy, and mixed-policy algorithms for various speech understanding challenges. You can read the full research paper here: Advancing Speech Understanding in Speech-Aware Language Models with GRPO.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -