TLDR: Omni-CLST is a new framework for Audio Question Answering (AQA) that uses an error-aware curriculum to organize training data by difficulty and a guided selective Chain-of-Thought (CoT) mechanism to focus reasoning on challenging cases. This approach, integrated with GRPO training, allows models to learn more efficiently from existing high-quality datasets. It achieves competitive accuracy on MMAU-mini and sets a new state-of-the-art on MMAR, demonstrating robust multimodal audio-language understanding without needing new, costly datasets.
In the rapidly evolving field of artificial intelligence, Large Audio-Language Models (LALMs) are pushing the boundaries of how machines understand and interact with audio. A particularly challenging task within this domain is Audio Question Answering (AQA), where models must accurately answer natural language questions based on audio inputs, often by selecting from given options.
Recent advancements have attempted to improve AQA performance by integrating deep reasoning, often through Chain-of-Thought (CoT) processes. While approaches like Audio-CoT and Audio-Reasoner have shown promise, they often face hurdles such as the high cost and time involved in creating new datasets, the computational intensity and slow convergence of reinforcement learning algorithms like GRPO (Group Relative Policy Optimization) when applied to full datasets, and the challenge of effectively utilizing CoT annotations without introducing redundant reasoning steps.
Addressing these critical challenges, researchers have introduced Omni-CLST, an innovative framework designed to enhance Audio Question Answering. Omni-CLST leverages an error-aware curriculum learning approach combined with a guided selective Chain-of-Thought mechanism. This framework is built upon the Qwen2.5-Omni model and aims to efficiently exploit existing high-quality datasets.
How Omni-CLST Works
The Omni-CLST framework operates in two main stages: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO).
In the SFT stage, the pretrained model first processes the training set. Crucially, it drops the Chain-of-Thought (CoT) for questions it answers correctly, but retains CoT for those it answers incorrectly. This initial step helps categorize samples based on their difficulty relative to the model’s current understanding.
Following SFT, an error-aware curriculum organizes the training samples into progressive difficulty levels. Samples correctly answered by the pretrained model are deemed ‘easy’. Those initially answered incorrectly but corrected after SFT are classified as ‘medium’. Finally, samples that remain incorrectly answered even after SFT are labeled ‘hard’. This structured organization allows the model to gradually focus on more challenging cases, with a larger proportion of medium and hard examples used in GRPO training to maximize learning efficiency.
The guided thought dropout mechanism is a key innovation. During SFT, if the pretrained model correctly answers a question, its CoT is removed. If it answers incorrectly, the CoT is retained. This explicitly guides the model to associate reasoning with more difficult problems. In the subsequent GRPO stage, this mechanism is further refined to selectively skip the CoT process for easier samples while preserving reasoning for harder ones, leading to more efficient utilization of CoT annotations and a better reasoning trajectory.
Also Read:
- Guiding Acoustic Scene Classification with Entropy for Better Generalization
- Enhancing Multimodal AI Safety: A New Approach to Optimizing Reasoning Paths
Performance and Impact
Experiments conducted on two prominent AQA benchmarks, MMAU-mini and MMAR, demonstrate the effectiveness of Omni-CLST. The framework achieved a competitive accuracy of 73.80% on MMAU-mini and established a new state-of-the-art accuracy of 64.30% on MMAR. These results highlight its robustness and generalization capabilities in multimodal audio-language understanding, all without the need for constructing additional QA datasets.
A significant finding was that a substantial portion of samples (67.6% on MMAR and 68.7% on MMAU-mini) could be solved without invoking the CoT process. By adaptively deciding when to engage in step-by-step reasoning, Omni-CLST not only achieves higher accuracy but also avoids unnecessary reasoning overhead. For instance, it significantly reduced the average number of tokens generated compared to models that use CoT for all questions, demonstrating improved efficiency.
Ablation studies further confirmed the individual contributions of each component: the SFT phase, the error-aware curriculum learning paradigm, and the guided thought dropout strategy all played crucial roles in boosting performance. The research paper detailing this innovative framework can be found here: OMNI-CLST: ERROR-A WARE CURRICULUM LEARNING WITH GUIDED SELECTIVE CHAIN-OF-THOUGHT FOR AUDIO QUESTION ANSWERING.
In summary, Omni-CLST offers an effective method to maximize the utility of high-quality datasets through its error-aware curriculum and guided selective Chain-of-Thought, enabling more efficient exploitation of informative reasoning signals and advancing the capabilities of audio question answering systems.


