TLDR: This paper introduces a novel approach to Automated Speaking Assessment (ASA) using Multimodal Large Language Models (MLLMs) to overcome limitations of traditional text- or audio-only systems. It proposes Speech-First Multimodal Training (SFMT), a two-stage curriculum learning strategy that first builds strong acoustic processing foundations before integrating textual information. Experiments show MLLMs significantly improve holistic assessment, and SFMT specifically enhances delivery assessment accuracy by 4%, demonstrating superior performance and generalization across various datasets.
Automated Speaking Assessment (ASA) systems are vital tools in language learning, providing objective and consistent evaluation for second-language (L2) speakers. However, traditional ASA methods have faced significant hurdles due to their inherent limitations in processing different types of information. Text-based systems, while good at understanding grammar and content, often miss crucial acoustic details like pronunciation and fluency because they rely on imperfect speech-to-text transcriptions. Conversely, audio-based systems excel at capturing these acoustic nuances but struggle to grasp the semantic context and linguistic content of a speaker’s response.
This fundamental challenge has motivated researchers to explore more comprehensive solutions. The emergence of Multimodal Large Language Models (MLLMs) presents a promising new direction. MLLMs are advanced AI models capable of simultaneously processing and integrating information from multiple sources, such as audio and text, within a single unified framework. Pioneering efforts in this field, like GPT-4o and Phi-4-multimodal, have showcased remarkable abilities in handling diverse inputs, opening new avenues for complex real-world applications like ASA.
A Unified Approach to Speaking Assessment
A recent research paper, “Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning,” presents the first systematic study of MLLMs for comprehensive ASA. The authors, Yu-Hsuan Fang, Tien-Hong Lo, Yao-Ting Sung, and Berlin Chen from National Taiwan Normal University, demonstrate that MLLMs can significantly outperform traditional methods across various assessment aspects, including content and language use. However, they identified that evaluating the ‘delivery’ aspect—which includes pronunciation accuracy, fluency, and prosody—still posed unique challenges, requiring specialized training strategies.
To address this, the researchers propose a novel training strategy called Speech-First Multimodal Training (SFMT). This approach leverages a curriculum learning principle, which means the model learns in a structured progression from simpler to more complex tasks. SFMT aims to build a more robust foundation for processing speech before integrating information from other modalities like text. This ensures that the model develops strong capabilities in analyzing fine-grained acoustic patterns, which are essential for accurate delivery assessment.
How Speech-First Multimodal Training Works
SFMT operates in two distinct stages. In the first stage, called “Acoustic Foundation,” the MLLM is trained exclusively on audio inputs. During this phase, the model focuses on learning to extract and understand acoustic features, such as intonation, rhythm, and pronunciation details. This stage is crucial because raw audio signals contain the complete spectrum of speech information, unlike text transcripts which are a “lossy transformation” that discards these paralinguistic features. The researchers found that the audio modality actually demonstrates superior learning efficiency for MLLM-based graders, especially for delivery assessment.
Once a robust acoustic foundation is established, the second stage, “Cross-Modal Integration,” begins. In this stage, the model is introduced to both audio and text inputs simultaneously. Building upon its strong acoustic understanding, the model then learns to effectively combine information from both modalities. This progressive approach helps overcome the common issue where models might preferentially optimize towards text-based features due to their structured nature, potentially underutilizing critical acoustic information.
Also Read:
- Optimizing Data Mixtures for Fine-Tuning Large Language Models
- Unpacking Prompt Sensitivity: A Deep Dive into LLM Robustness
Impressive Results and Future Directions
Experiments conducted on benchmark datasets, including the TEEMI corpus and the Speak & Improve Corpus, showcased the effectiveness of this MLLM-based system. The holistic assessment performance saw a notable improvement, with the Pearson Correlation Coefficient (PCC) value increasing from 0.783 to 0.846. More specifically, SFMT excelled in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches. This highlights SFMT’s success in enhancing the model’s ability to perform fine-grained acoustic discrimination.
The research also confirmed the model’s robust generalization capability, performing well on previously unseen tasks and across different L2 populations and assessment contexts. These findings suggest that MLLM-based models, particularly when trained with strategies like SFMT, can serve as a transformative backbone for ASA, leading to more accurate, comprehensive, and generalizable evaluation systems. For more details, you can read the full paper here.
Future work in this area will explore multi-task learning frameworks for assessing multiple aspects simultaneously and integrating comprehensive feedback generation into ASA systems. The ultimate goal is to create intelligent, adaptive language learning environments that can provide personalized, real-time guidance to L2 learners.


