TLDR: PEMUTA is a new AI framework that uses large language models (LLMs) to assess undergraduate theses with detailed, multi-granular feedback. Unlike traditional holistic scoring, PEMUTA evaluates theses across six dimensions (Structure, Logic, Originality, Writing, Proficiency, Rigor) based on Vygotsky’s theory and Bloom’s Taxonomy. It employs hierarchical prompting, few-shot learning, and role-play prompting to align with expert judgments, providing more interpretable and pedagogically relevant evaluations.
Undergraduate theses are a cornerstone of academic assessment, serving as a comprehensive measure of a student’s cumulative academic development. However, the traditional methods of evaluating these lengthy and complex documents, whether manual or automated, often fall short. Manual assessment is time-consuming and labor-intensive, while existing automated systems, typically powered by large language models (LLMs), tend to offer only a single, holistic score. This broad evaluation often overlooks the intricate details across various criteria, limiting the depth of feedback students receive and failing to align with established pedagogical objectives.
Addressing this critical gap, researchers have pioneered a novel framework called PEMUTA: Pedagogically-Enriched Multi-Granular Undergraduate Thesis Assessment. This innovative approach aims to activate the domain-specific knowledge within LLMs to provide a more nuanced and detailed evaluation of undergraduate theses.
A Foundation in Educational Theory
PEMUTA is built upon two foundational pedagogical theories widely used in manual thesis evaluation: Vygotsky’s sociocultural theory and Bloom’s Taxonomy. Vygotsky’s theory emphasizes the developmental and learning potential aspects of a thesis, focusing on how students progress towards independent academic competence. Bloom’s Taxonomy, on the other hand, provides a structured hierarchy of cognitive skills, from remembering to creating, which is crucial for instructional design and assessment.
By integrating insights from both theories, PEMUTA defines six pedagogically grounded dimensions for assessment, collectively abbreviated as SLOWPR:
-
Structure: Evaluates the organization, coherence, and logical flow of the thesis chapters.
-
Logic: Assesses the clarity and consistency of arguments, ensuring alignment between research questions, methodology, evidence, and conclusions.
-
Originality: Examines the novelty and insightfulness of the thesis, including new perspectives or solutions.
-
Writing: Focuses on linguistic clarity, grammatical accuracy, academic tone, and adherence to disciplinary writing conventions.
-
Proficiency: Measures the student’s mastery of disciplinary knowledge, including their understanding, application, and analysis of concepts and methods.
-
Rigor: Evaluates adherence to academic conventions, citation accuracy, source reliability, and ethical compliance.
How PEMUTA Works
The framework employs a hierarchical prompting strategy that guides the LLM through a two-stage evaluation process. In the first stage, the model performs dimension-specific assessments for each of the six SLOWPR criteria, generating individual scores and justifications. This decomposition helps reduce interference between different criteria and allows for more targeted activation of relevant knowledge within the LLM. Once these fine-grained assessments are complete, the model proceeds to the second stage, synthesizing them into a coherent holistic evaluation, which includes an overall score and practical suggestions for improvement.
To further enhance alignment with expert judgments without requiring extensive fine-tuning, PEMUTA incorporates two in-context learning techniques:
-
Few-shot prompting: The model is provided with a few formatted examples of multi-granular thesis evaluations, helping it internalize the desired structure and format of rubric-aligned assessments.
-
Role-play prompting: The LLM is instructed to assume the persona of an experienced university professor or thesis committee member. This role conditioning encourages the model to adopt a formal academic tone, use discipline-appropriate vocabulary, and apply expert evaluative reasoning.
The MUTA Dataset and Experimental Validation
To support this multi-granular assessment task, a new dataset called MUTA (Multi-granular Undergraduate Thesis Assessment) was curated. It comprises 60 authentic undergraduate theses from Computer Science students, each manually annotated with 0-10 scale ratings across the SLOWPR dimensions and a holistic score. The theses, originally in PDF format, undergo a meticulous pre-processing pipeline to convert them into a clean, semantically consistent, and logically structured plain-text representation suitable for LLM processing.
Extensive experiments demonstrate that PEMUTA consistently outperforms standard holistic prompting strategies across various state-of-the-art LLMs. It achieves significantly lower Mean Absolute Error (MAE) and Mean Squared Error (MSE), and higher Pearson Correlation Coefficient (PCC) with expert ratings, indicating a stronger agreement with human evaluations. Ablation studies further confirm that each component—hierarchical prompting, few-shot exemplars, and role-play instructions—contributes meaningfully and synergistically to the framework’s enhanced performance.
Also Read:
- EduAlign: A Framework for Crafting Smarter, More Engaging AI Tutors
- Assessing Voice Understanding in AI: Introducing the Speech Intelligence Quotient
Looking Ahead
PEMUTA represents a significant advancement in automated undergraduate thesis assessment, offering a scalable, interpretable, and pedagogically grounded solution. By providing detailed, criterion-specific feedback alongside a holistic judgment, it empowers students with actionable insights for improvement and alleviates the workload of educators. Future work aims to extend PEMUTA into multimodal assessment frameworks, incorporating code artifacts, presentation recordings, and other learning outputs for an even more comprehensive evaluation of students’ competencies. You can read the full research paper here.


