TLDR: This research investigates whether AI models trained on existing datasets can be effectively used for new, unexplored datasets, specifically in automated short answer grading. By comparing the STSB, Mohler, and a new Python-focused SPRAG dataset using various similarity metrics and statistical tests, the study found that models from the Mohler dataset show promising transferability to the SPRAG dataset, potentially reducing the need for extensive new training.
In the rapidly evolving field of Natural Language Processing (NLP), the development of models often requires extensive fine-tuning and optimization for specific datasets. This process can be resource-intensive and time-consuming. A recent study delves into a crucial question: can knowledge embedded within state-of-the-art (SOTA) models, trained on established datasets, be effectively transferred to new, unexplored domains?
The research, titled “Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading,” explores this challenge within the context of Automated Short Answer Grading (ASAG). ASAG involves automatically assessing student responses by comparing them to reference answers. While SOTA models use similarity metrics for score predictions, the availability of diverse datasets for this task remains limited.
To address this, the study selected two well-known benchmark datasets, STSB (Semantic Textual Similarity Benchmark) and Mohler, and compared them with a newly introduced dataset called SPRAG. The STSB dataset is a widely recognized benchmark for evaluating semantic likeness, while the Mohler dataset is popular within the ASAG research community. The SPRAG dataset, however, is novel and focuses specifically on the domain of Python programming, incorporating unique keywords and symbols like ‘def’, ‘del’, and ‘*’.
A key difference between these datasets lies in their content and structure. While STSB and Mohler primarily contain sentences in natural English, SPRAG includes programming-specific terminology. The distribution of similarity scores also varies; Mohler shows a significant imbalance with most examples falling into a high similarity label, suggesting a potential for overfitting. STSB, in contrast, exhibits a more balanced distribution, indicating greater generalizability. SPRAG also has a substantial portion of records in the highest similarity label, but the distribution for other classes is more balanced.
The researchers employed a meticulous comparative analysis using both contextual and non-contextual similarity metrics. Non-contextual methods, such as Jaccard Similarity, TF-IDF Cosine Similarity, and Word Mover Distance (WMD), rely on lexical or linguistic properties without considering surrounding context. Contextual methods, including Universal Sentence Encoder (USE), SBERT Cross Encoder (SBERT CE), SBERT Bi-Encoder (SBERT BiE), and SimCSE (both supervised and unsupervised variants), capture semantic meaning by considering the surrounding context of words and phrases.
In addition to similarity metrics, robust statistical techniques were applied, including the paired t-test and Cohen’s d for effect size quantification. These tools helped to rigorously analyze observed similarities and determine the practical significance of differences between datasets.
The experimental results revealed interesting insights. Non-contextual similarity metrics performed favorably for the Mohler dataset but were less effective for STSB and SPRAG. Contextual metrics, while consistent for Mohler and STSB, faced challenges with the intricate sentence structures found in the SPRAG dataset. Crucially, the analysis showed a minor resemblance between STSB and Mohler, but a more pronounced similarity between Mohler and SPRAG. This was supported by the observation that approximately 20% of the most frequently occurring words in Mohler and SPRAG coincided, a pattern not seen between STSB and the other datasets.
Furthermore, Cohen’s d values, which measure the magnitude of difference, indicated a more significant effect size between SPRAG and Mohler compared to SPRAG and STSB. This suggests a stronger practical similarity between the Mohler and SPRAG datasets.
Also Read:
- New AI Model Learns Hidden Concepts to Improve Student Learning and Exercise Recommendations
- Enhancing Stealthy Backdoor Attacks in Text AI Through Strategic Data Selection
The findings underscore the concept of transferability across NLP models. The study concludes that the newly introduced SPRAG dataset exhibits both semantic and statistical proximity to the Mohler dataset, surpassing its similarity with STSB. This key revelation implies that the knowledge and insights gained from SOTA models developed for the Mohler dataset could potentially be transposed and evaluated on the novel SPRAG dataset with promising outcomes, potentially reducing the demand for resource-intensive, dataset-specific training. For more details, you can read the full research paper here.


