LLMs Tackle Uncompilable Student Code to Enhance Learning Analytics

TLDR: A study investigates the use of large language models (LLMs) like GPT-5, Claude 3.5 Haiku, and Gemini 2.5 Flash for automated program repair of uncompilable student code. The goal is to make student submissions compilable while preserving their original structural and logical intent, enabling richer student modeling. The research found that all three LLMs effectively produced compilable repairs, with GPT-5 generally making the most minimal edits and best preserving student logic. The study highlights the potential of LLMs to recover valuable learning data from previously discarded uncompilable code, though challenges remain in ensuring pedagogical alignment.

A new study explores how large language models (LLMs) can help recover uncompilable student code, a common issue in introductory computer science courses. Often, student programming submissions in environments like CS1 are uncompilable due to syntax errors. This prevents them from being used in student modeling and understanding learning progress, as traditional analysis pipelines typically discard such submissions.

The research, titled Automated Program Repair of Uncompilable Student Code, investigates automated program repair (APR) as a method to make these codes compilable again while preserving the student’s original structural intent. This preservation is crucial for accurately modeling student learning and providing relevant feedback.

The Challenge of Uncompilable Code

Intelligent tutoring systems and student modeling approaches rely heavily on student submissions to track learning. However, a significant portion of novice programs frequently fail to compile because of syntax errors. These uncompilable submissions are usually excluded from analysis because they lack an evaluable score or performance measure. This exclusion means potentially valuable insights into students’ intermediate reasoning and knowledge states are lost, leading to incomplete models of their learning.

Automated Program Repair with LLMs

While Automated Program Repair (APR) has been used in educational contexts to fix erroneous code and provide feedback, many APR methods can significantly alter the student’s original approach. For student modeling, such semantic rewrites risk distorting the very evidence of learning that researchers aim to track.

This study focuses on *syntax-only repair*, which aims to fix small, surface-level mistakes like missing semicolons, unmatched braces, or undeclared variables with minimal edits. The goal is to make the code executable while retaining the student’s original structure and logic, thus preserving their learning trajectory.

Methodology and Evaluation

The researchers used a publicly available dataset from the CodeWorkout platform, an online programming learning environment, consisting of anonymized Java submissions from a CS1 course. From this dataset, 100 uncompilable Java submissions were randomly selected for analysis.

Three prominent large language models were assessed as repair agents: GPT-5 (OpenAI), Claude 3.5 Haiku (Anthropic), and Gemini 2.5 Flash (Google). Each model was tested under two prompting conditions: a low-context condition (only student code and repair instructions) and a high-context condition (including compiler messages, problem statements, and few-shot examples).

Repairs were evaluated based on three criteria:

Compilation Success: Whether the repaired code executed without syntax errors.
Edit Distance: How closely the repair aligned with the original code, measured by normalized Levenshtein distance.
Human Evaluation: Experts annotated repairs for Structural Preservation (maintaining original control flow) and Logical Preservation (only syntactic edits, no changes to logical or semantic structure).

Key Findings

The study yielded several important results:

High Compilation Success: All three LLMs demonstrated high and consistent compilation success rates. GPT-5 achieved 98.5%, Claude 3.5 96%, and Gemini 2.5 95.5%. Notably, the prompting condition (low or high context) did not significantly affect compilability, indicating that LLMs can effectively produce compilable repairs even with minimal context.
Edit Distance Differences: There were significant differences in edit distance among models. GPT-5 produced the smallest average edits (11.4), followed by Gemini 2.5 (13.8) and Claude 3.5 (24.4). This suggests GPT-5 made the most minimal changes to achieve compilability.
Preservation of Structure and Logic: Human evaluation revealed significant differences in how well models preserved students’ original structure and logic. For Structural Preservation, Gemini 2.5 performed slightly better (97.9%) than GPT-5 (96.4%), with Claude 3.5 trailing (88.5%). For Logical Preservation, GPT-5 showed the highest proportion of logic-preserving repairs (86.5%), followed by Gemini 2.5 (83.9%) and Claude 3.5 (67.8%). Again, prompting condition had no significant effect on these measures.

Also Read:

Implications for Education

The findings suggest that large language models can reliably perform syntax-only repairs on student code snippets, often adhering to explicit instructions. However, models occasionally made stylistic or structural edits beyond minimal correction, highlighting an ongoing challenge in aligning LLM behavior with pedagogical goals.

This work opens avenues for richer and more comprehensive analyses of learners’ coding processes. By recovering uncompilable submissions, educators and researchers can gain a more complete understanding of students’ intermediate reasoning, misconceptions, and problem-solving progress over time. Future research will focus on more rigorous evaluation frameworks, integrating repaired submissions into intelligent tutoring systems, and fine-tuning LLMs for better pedagogical alignment.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

LLMs Tackle Uncompilable Student Code to Enhance Learning Analytics

The Challenge of Uncompilable Code

Automated Program Repair with LLMs

Methodology and Evaluation

Key Findings

Implications for Education

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates