Transparent AI for Interpreting Assessment in Higher Education

TLDR: This research introduces an innovative AI framework for assessing college students’ English-Chinese interpreting skills. By integrating feature engineering, data augmentation using Variational Autoencoders (VAEs), and Explainable AI (XAI) techniques like SHAP, the framework moves beyond ‘black box’ predictions to offer transparent, multi-dimensional evaluations of fidelity, fluency, and target language quality. The study demonstrates improved model performance with augmented data and identifies specific linguistic features that influence scores, providing detailed, actionable feedback for both students and educators to enhance learning and teaching practices.

Interpreting, or oral translation, is a vital linguistic skill that offers significant educational benefits, fostering advanced linguistic, communication, cognitive, and emotional abilities. It enhances active listening, oral proficiency, vocabulary acquisition, and cross-cultural communication, while also strengthening higher-order cognitive functions and anxiety management. Given its multifaceted advantages, interpreting is increasingly recognized as both a valuable teaching tool and the “fifth skill” alongside listening, speaking, reading, and writing.

The complex nature of interpreting necessitates continuous structured practice, rigorous assessment, and diagnostic feedback. However, traditional human-based assessment is often cognitively demanding for raters, requiring them to simultaneously consult source texts, interpreted outputs, and detailed rating scales. This process increases the risk of scoring bias and inconsistency.

The limitations of human evaluation have spurred interest in automated assessment. Yet, existing automated methods face challenges. Research has disproportionately focused on fidelity (information completeness) and fluency, with less attention paid to language use quality. Furthermore, prior studies often relied on conventional statistical methods that assume linearity, which may not hold true for complex, real-world data. The advent of machine learning (ML) and large language models (LLMs) offers new opportunities, but their application is hindered by severe data imbalance, where most datasets are skewed towards average performance, lacking samples of very high or very low quality. Another significant limitation is the inherent opacity of many automated scoring systems, often referred to as “black box” models, which provide only final scores without explaining their decision-making processes. This lack of transparency severely limits their diagnostic and educational utility.

To address these challenges, researchers propose a novel approach that combines feature engineering, data augmentation, and explainable AI (XAI) techniques to evaluate interpreting performance across three key dimensions: fidelity, fluency, and target language quality. This framework prioritizes explainability by using only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis.

A New Approach to Assessment

The study introduces a multi-dimensional modeling framework. First, a new dataset of 117 English-Chinese consecutive interpreting samples from university students was compiled. To overcome the issues of small sample size and imbalanced score distribution, Variational Autoencoders (VAEs) were employed for data augmentation, generating realistic, synthetic feature vectors. This process expanded the dataset to 500 samples, achieving a more uniform distribution of interpretation scores.

From these samples, a broad set of features were extracted for each scoring dimension:

InfoCom (Information Completeness/Fidelity): Measured using established machine translation quality assessment metrics like BLEURT, CometKiwi, BERTScore, chrF, and xCOMET.
FluDel (Fluency): Included 14 temporal features categorized into speed fluency (e.g., speech rate, articulation rate) and breakdown fluency (e.g., number of unfilled pauses, mean length of filled pauses).
TLQual (Target Language Quality): Evaluated through 25 features related to syntactic complexity and grammatical accuracy, including Chinese-specific phraseological diversity measures and grammatical error annotations from GPT-4o.

Three types of machine learning models—XGBoost, Random Forest (RF), and Multi-Layer Perceptron (MLP)—were trained and validated. The results showed that models trained on the augmented dataset achieved significantly higher performance. Specifically, RF performed best for InfoCom prediction, while XGBoost excelled in predicting FluDel and TLQual scores, demonstrating substantial improvement over models trained on raw data.

Also Read:

Understanding the AI’s Decisions with Explainable AI

A crucial aspect of this research is the application of SHAP analysis to interpret model behavior at both global (overall model) and local (individual predictions) levels. This provides insights into which features most influence the predicted scores.

For InfoCom: Neural-based metrics like BLEURT and CometKiwi were identified as the strongest positive predictors, meaning higher scores in these metrics correlated with better fidelity.
For FluDel: Pause-related features, particularly the number of filled pauses (NFP), had the most pronounced negative impact. This suggests that frequent or long pauses significantly reduce perceived fluency. Speed fluency features generally had a small positive effect.
For TLQual: Word selection errors (NWSE) had a significant negative impact. Chinese-specific phraseological diversity metrics, such as CN_RATIO, were highly influential positive factors, indicating that diverse and sophisticated use of these structures leads to higher scores. Interestingly, longer but less syntactically dense sentences were perceived as higher quality in the Chinese interpreting context.

The study emphasizes the critical role of local explanations in automated interpreting assessment. For educators, these explanations offer actionable insights into specific strengths and weaknesses of individual student performances, enabling tailored feedback and instructional strategies. For students, local explanations empower them to take ownership of their learning by focusing on precise areas needing attention. For example, if a student’s fluency score is negatively impacted by filled pauses, instructors can implement targeted exercises like shadowing practices or drills to minimize hesitation.

This framework represents a promising direction in translating AI-driven insights into pedagogical tools that deliver actionable feedback to trainees, effectively bridging the gap between automated assessment and student learning. For more details, you can refer to the full research paper: From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Transparent AI for Interpreting Assessment in Higher Education

A New Approach to Assessment

Understanding the AI’s Decisions with Explainable AI

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Artificial Intelligence Revolutionizes Educator Development and Personalized Learning, New Studies Reveal

DiagramIR: Advancing Automated Evaluation for Educational Math Diagrams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates