TLDR: Large Language Models (LLMs) excel at solving math problems but struggle to identify specific errors in student solutions. This research introduces “corrected student solutions,” which are intermediate versions of student work that fix errors while maintaining the student’s original approach. Experiments show that providing these corrected solutions significantly improves LLMs’ ability to accurately locate the first error step, even more so than providing standard reference solutions. The study highlights that strong problem-solving skills in LLMs do not automatically translate to effective error detection, emphasizing the need for specialized meta-reasoning capabilities.
Large Language Models (LLMs) have shown impressive capabilities in solving complex math word problems, with some models achieving near-perfect accuracy on challenging benchmarks. However, a recent study reveals a significant hurdle: these advanced AI models struggle with a crucial meta-reasoning task – identifying errors in student solutions, especially pinpointing the exact first mistake.
This challenge is particularly relevant for developing intelligent tutoring systems. Imagine an AI tutor that can solve any math problem but can’t tell a student precisely where they went wrong. The ability to accurately locate and categorize errors is vital for providing effective, personalized feedback to learners.
The Problem with Traditional Approaches
Previous research has shown that even when LLMs are given the original math problem and an incorrect student solution, their accuracy in identifying the first error step remains low. Intuitively, one might think that providing a ‘gold standard’ or reference solution would help. However, this study found that while providing a reference solution does improve performance, most LLMs still struggle to pinpoint the exact error step.
The core issue lies in the alignment between the student’s solution and the reference solution. Students often take different approaches, use different intermediate variables, or break down problems into a different number of steps than a canonical reference solution. This ‘poor step alignment’ and ‘different approaches’ make it difficult for LLMs to compare and identify the precise point of error.
Introducing Corrected Student Solutions
To address this, the researchers propose an innovative approach: generating an ‘intermediate corrected student solution.’ This isn’t just another reference solution; it’s a version of the student’s original solution that has been corrected to be mathematically sound, but crucially, it retains the student’s original method and style. By aligning more closely with the student’s reasoning, this corrected version acts as a more effective benchmark for error detection.
The process involves an LLM acting as a ‘teacher,’ taking the gold solution and the student’s erroneous solution, and then generating a corrected version of the student’s work. This disentangles the LLM’s problem-solving ability from its error detection ability, allowing it to focus on comparing and correcting rather than solving from scratch.
Also Read:
- Ensuring Language Models Reason for the Right Reasons
- SATQuest: A New Approach to Evaluating and Enhancing AI’s Logical Reasoning
Key Findings and Insights
Experiments were conducted on two datasets, VtG and PRM800K, using a diverse set of LLMs including Llama models, GPT-4o, Qwen2.5-72B-Math, and LearnLM-1.5-Pro. The results were compelling:
- Providing the corrected student solution significantly boosted error localization performance across most models and datasets, outperforming scenarios where only the gold solution was provided.
- Interestingly, an LLM’s high problem-solving ability does not guarantee effective error detection. For example, Qwen2.5-72B-Math, while excellent at solving problems, showed the poorest error localization, often failing to rectify the first error and instead making inaccurate deductions later to match the final answer.
- A feature importance analysis revealed that ‘semantic recall’ – how well the reference solution aligns with the student’s work up to the first error – was the most critical factor for successful error localization. The relative position and type of error also played significant roles.
- The type of error influenced how far off predictions were. Errors stemming from a misunderstanding of the question were often predicted much later than they occurred, while errors involving missing or extra variables were sometimes predicted slightly earlier.
In conclusion, this research highlights that while LLMs are powerful problem solvers, their meta-reasoning capabilities, particularly in error localization, require targeted improvement. The introduction of corrected student solutions offers a promising path forward, demonstrating that better alignment between student work and a reference can significantly enhance an AI’s ability to pinpoint mistakes. This work paves the way for more effective AI-powered educational tools that can provide precise and helpful feedback to students. You can read the full paper here.


