spot_img
HomeResearch & DevelopmentTransparent AI for Interpreting Assessment in Higher Education

Transparent AI for Interpreting Assessment in Higher Education

TLDR: This research introduces an innovative AI framework for assessing college students’ English-Chinese interpreting skills. By integrating feature engineering, data augmentation using Variational Autoencoders (VAEs), and Explainable AI (XAI) techniques like SHAP, the framework moves beyond ‘black box’ predictions to offer transparent, multi-dimensional evaluations of fidelity, fluency, and target language quality. The study demonstrates improved model performance with augmented data and identifies specific linguistic features that influence scores, providing detailed, actionable feedback for both students and educators to enhance learning and teaching practices.

Interpreting, or oral translation, is a vital linguistic skill that offers significant educational benefits, fostering advanced linguistic, communication, cognitive, and emotional abilities. It enhances active listening, oral proficiency, vocabulary acquisition, and cross-cultural communication, while also strengthening higher-order cognitive functions and anxiety management. Given its multifaceted advantages, interpreting is increasingly recognized as both a valuable teaching tool and the “fifth skill” alongside listening, speaking, reading, and writing.

The complex nature of interpreting necessitates continuous structured practice, rigorous assessment, and diagnostic feedback. However, traditional human-based assessment is often cognitively demanding for raters, requiring them to simultaneously consult source texts, interpreted outputs, and detailed rating scales. This process increases the risk of scoring bias and inconsistency.

The limitations of human evaluation have spurred interest in automated assessment. Yet, existing automated methods face challenges. Research has disproportionately focused on fidelity (information completeness) and fluency, with less attention paid to language use quality. Furthermore, prior studies often relied on conventional statistical methods that assume linearity, which may not hold true for complex, real-world data. The advent of machine learning (ML) and large language models (LLMs) offers new opportunities, but their application is hindered by severe data imbalance, where most datasets are skewed towards average performance, lacking samples of very high or very low quality. Another significant limitation is the inherent opacity of many automated scoring systems, often referred to as “black box” models, which provide only final scores without explaining their decision-making processes. This lack of transparency severely limits their diagnostic and educational utility.

To address these challenges, researchers propose a novel approach that combines feature engineering, data augmentation, and explainable AI (XAI) techniques to evaluate interpreting performance across three key dimensions: fidelity, fluency, and target language quality. This framework prioritizes explainability by using only construct-relevant, transparent features and conducting Shapley Value (SHAP) analysis.

A New Approach to Assessment

The study introduces a multi-dimensional modeling framework. First, a new dataset of 117 English-Chinese consecutive interpreting samples from university students was compiled. To overcome the issues of small sample size and imbalanced score distribution, Variational Autoencoders (VAEs) were employed for data augmentation, generating realistic, synthetic feature vectors. This process expanded the dataset to 500 samples, achieving a more uniform distribution of interpretation scores.

From these samples, a broad set of features were extracted for each scoring dimension:

  • InfoCom (Information Completeness/Fidelity): Measured using established machine translation quality assessment metrics like BLEURT, CometKiwi, BERTScore, chrF, and xCOMET.
  • FluDel (Fluency): Included 14 temporal features categorized into speed fluency (e.g., speech rate, articulation rate) and breakdown fluency (e.g., number of unfilled pauses, mean length of filled pauses).
  • TLQual (Target Language Quality): Evaluated through 25 features related to syntactic complexity and grammatical accuracy, including Chinese-specific phraseological diversity measures and grammatical error annotations from GPT-4o.

Three types of machine learning models—XGBoost, Random Forest (RF), and Multi-Layer Perceptron (MLP)—were trained and validated. The results showed that models trained on the augmented dataset achieved significantly higher performance. Specifically, RF performed best for InfoCom prediction, while XGBoost excelled in predicting FluDel and TLQual scores, demonstrating substantial improvement over models trained on raw data.

Also Read:

Understanding the AI’s Decisions with Explainable AI

A crucial aspect of this research is the application of SHAP analysis to interpret model behavior at both global (overall model) and local (individual predictions) levels. This provides insights into which features most influence the predicted scores.

  • For InfoCom: Neural-based metrics like BLEURT and CometKiwi were identified as the strongest positive predictors, meaning higher scores in these metrics correlated with better fidelity.
  • For FluDel: Pause-related features, particularly the number of filled pauses (NFP), had the most pronounced negative impact. This suggests that frequent or long pauses significantly reduce perceived fluency. Speed fluency features generally had a small positive effect.
  • For TLQual: Word selection errors (NWSE) had a significant negative impact. Chinese-specific phraseological diversity metrics, such as CN_RATIO, were highly influential positive factors, indicating that diverse and sophisticated use of these structures leads to higher scores. Interestingly, longer but less syntactically dense sentences were perceived as higher quality in the Chinese interpreting context.

The study emphasizes the critical role of local explanations in automated interpreting assessment. For educators, these explanations offer actionable insights into specific strengths and weaknesses of individual student performances, enabling tailored feedback and instructional strategies. For students, local explanations empower them to take ownership of their learning by focusing on precise areas needing attention. For example, if a student’s fluency score is negatively impacted by filled pauses, instructors can implement targeted exercises like shadowing practices or drills to minimize hesitation.

This framework represents a promising direction in translating AI-driven insights into pedagogical tools that deliver actionable feedback to trainees, effectively bridging the gap between automated assessment and student learning. For more details, you can refer to the full research paper: From Black Box to Transparency: Enhancing Automated Interpreting Assessment with Explainable AI in College Classrooms.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -