Evaluating Second Language Speaking Skills with AI: A Natural Language Approach

TLDR: Researchers developed Natural Language-based Assessment (NLA), a method using open-source LLMs (Qwen 2.5 72B) to evaluate L2 oral proficiency. It interprets human-like “can-do descriptors” from speech transcriptions in a zero-shot setting, providing interpretable analytic scores. NLA outperforms BERT-based models and matches speech LLMs trained on limited data, showing strong potential for accessible and detailed automated language assessment, especially where spontaneous training data is scarce.

Assessing how well someone speaks a second language (L2) has traditionally relied on human examiners. These experts use detailed guidelines, often called “can-do descriptors,” to evaluate various aspects of a learner’s proficiency. However, human assessment can be costly, time-consuming, and sometimes inconsistent. This has led to a growing interest in automated systems for language assessment.

Recent research from the ALTA Institute at Cambridge University introduces a novel approach called Natural Language-based Assessment (NLA). This method leverages the power of large language models (LLMs) to interpret and apply these human-intended descriptors directly to learner speech. The goal is to see if LLMs can assess language proficiency in a way that mirrors human judgment, but with the benefits of automation.

The core idea behind NLA is to use LLMs in a “zero-shot” setting. This means the LLM isn’t specifically trained on a vast dataset of graded speech examples for the assessment task. Instead, it’s given the raw text transcriptions of a learner’s spoken responses along with the same natural language descriptors that human examiners would use. The LLM then interprets these descriptors and applies them to the text, much like a human would.

For their experiments, the researchers used an open-source LLM called Qwen 2.5 72B. They fed it transcriptions generated by OpenAI’s Whisper model and asked it to evaluate ten different proficiency aspects, such as grammatical accuracy, fluency, vocabulary range, and coherence. Each aspect was rated on a scale from A1 to C2, aligning with the Common European Framework of Reference for Languages (CEFR).

One of the significant advantages of NLA is its interpretability. Unlike many existing automated systems that provide only a single, overall score, NLA can break down the assessment into these individual analytic components. This means learners can receive detailed feedback on their strengths and weaknesses, which is crucial for effective language learning. For instance, the study found that different parts of a language exam might emphasize different skills; fluency was more important in short responses, while grammatical accuracy and thematic development were key in longer, opinion-based tasks.

The results of the study were promising. The NLA approach, relying solely on textual information, performed competitively. While it didn’t surpass state-of-the-art speech LLMs that were extensively fine-tuned on spontaneous spoken data (which requires significant computational resources), it consistently outperformed a BERT-based model specifically trained for this assessment task. Furthermore, NLA matched the performance of a speech LLM trained on easier-to-collect “read-aloud” data, highlighting its effectiveness in situations where spontaneous speech data for training is scarce.

Also Read:

This research suggests that NLA offers a powerful, cost-effective, and interpretable way to assess L2 oral proficiency. Its zero-shot nature and reliance on widely applicable CEFR descriptors also mean it could be easily adapted for assessing different types of speech, including conversational language, and potentially even other languages. This work paves the way for more accessible and insightful automated language assessment tools. You can read the full research paper for more details at Natural Language-based Assessment of L2 Oral Proficiency using LLMs.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating Second Language Speaking Skills with AI: A Natural Language Approach

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates