TLDR: A new study demonstrates that the Whisper ASR model can be used beyond transcription for L2 English oral assessment. By extracting acoustic and linguistic features from Whisper’s hidden layers, combined with auxiliary image and text prompt information, a lightweight classifier achieves superior performance on spoken language assessment tasks. The research reveals Whisper intrinsically encodes proficiency patterns and semantic aspects of speech, highlighting its untapped potential for language evaluation.
A new study delves into the advanced capabilities of the Whisper automatic speech recognition (ASR) model, revealing its potential far beyond simple transcription for assessing non-native English speakers. Researchers from National Taiwan Normal University have explored how Whisper’s internal workings, specifically its “hidden representations,” can be harnessed for L2 (second language) spoken language assessment (SLA).
Traditionally, ASR models like Whisper are used to convert speech into text. Previous research in SLA often analyzed these transcriptions for errors or fluency. However, this new approach takes a different route, treating Whisper not just as a transcriber but as a sophisticated feature extractor. By looking into the model’s intermediate and final outputs, the researchers can pull out rich acoustic (sound-related) and linguistic (language-related) information that is crucial for evaluating spoken language proficiency.
Unlocking Whisper’s Deeper Insights
The core of this innovative method involves extracting features from both Whisper’s encoder and decoder components. The encoder processes the raw audio, capturing acoustic nuances, while the decoder handles the language aspects. A significant challenge with Whisper is its 30-second input limit. To overcome this for longer spoken responses typical in language assessments, the team developed a “chunking” strategy. This involves splitting longer audio into overlapping 30-second segments, allowing the model to process the entire utterance comprehensively.
For acoustic features, the model extracts embeddings from the encoder’s last hidden states, which are then aggregated hierarchically to form a single representation for the entire utterance. Similarly, for linguistic features, a technique called “pseudo-teacher forcing” is employed. Instead of computationally expensive autoregressive decoding, the model is provided with transcription tokens (which can come from any ASR system or ground truth) along with prefix tokens, allowing for efficient extraction of linguistic embeddings from the decoder.
Integrating Multimodal Cues for Enhanced Assessment
Beyond the intrinsic features from Whisper, the research introduces auxiliary information to further refine the assessment. For tasks like picture descriptions, where learners respond to visual prompts, two additional scores are incorporated:
- Semantic Textual Similarity (STS) score: This measures how semantically coherent the learner’s response is with a given text prompt, using a pre-trained SBERT model.
- Image-Text Contrastive (ITC) score: This evaluates the relevance between the learner’s spoken response (converted to text) and the visual prompt (image), utilizing the BLIP2 vision-language model.
These auxiliary features help capture aspects of language competence such as content relevance and prompt coherence, which are vital for a holistic assessment.
Also Read:
- Unlocking Adaptability: New Benchmark for Editing Auditory Knowledge in AI Models
- Advancing AI’s Ability to Understand Long Audio
A Lightweight Classifier and Impressive Results
Once these diverse features (acoustic, linguistic, STS, ITC) are extracted, they are combined and fed into a lightweight classifier. This classifier is then trained to predict proficiency scores. The beauty of this approach is that Whisper itself remains “frozen” – meaning its core parameters are not retrained – acting purely as a powerful feature extractor. Only the small classifier on top needs training.
The method was tested on the GEPT picture-description dataset, which includes authentic spoken responses for intermediate-level English assessment. The results were highly promising, with the proposed method outperforming existing state-of-the-art baselines, including other multimodal approaches. The analysis also showed that Whisper’s acoustic embeddings inherently align with proficiency scores, and its linguistic embeddings capture both topic-based structure and score-related gradients, even without specific fine-tuning for assessment tasks.
This study highlights Whisper’s intrinsic ability to encode both ordinal proficiency patterns and semantic aspects of speech, positioning it as a robust foundation for spoken language assessment and other complex spoken language understanding tasks. For more technical details, you can refer to the full research paper here.


