TLDR: MathSE is a new framework that significantly improves multimodal large language models’ (MLLMs) ability to solve complex math problems. Unlike previous methods that rely on static datasets, MathSE uses an iterative process of inference, reflection, and reward-based feedback, guided by a specialized Outcome Reward Model (ORM), to continuously refine the model’s reasoning capabilities and achieve state-of-the-art performance on various benchmarks.
Multimodal large language models (MLLMs) have shown impressive abilities in understanding both images and text, making them great for tasks that combine vision and language. However, these models often struggle when faced with more complex challenges, especially in mathematical problem-solving. Traditional methods have tried to improve these models by fine-tuning them on specialized math datasets. The problem with these datasets is that they are usually created from ‘teacher’ models, capturing only fixed ways of thinking. This limits the student models’ ability to adapt to new or more complicated questions and lacks the deep, iterative learning needed for robust generalization.
To address these limitations, researchers have introduced MathSE, a Mathematical Self-Evolving framework designed for MLLMs. Unlike older methods that fine-tune a model just once, MathSE continuously improves the model through repeated cycles of making inferences, reflecting on those inferences, and receiving feedback based on rewards. This process helps the model learn and adapt more effectively.
How MathSE Works
MathSE operates through three main stages, inspired by how humans learn:
First, there’s **Knowledge Distillation**. This stage begins by fine-tuning a base MLLM using a high-quality dataset derived from advanced models like GPT-4o. This initial training helps the model grasp fundamental mathematical reasoning skills.
Next is the **Iterative Self-Evolving** stage. After the initial fine-tuning, the model generates reasoning paths for the remaining math problems. A crucial component here is the specialized **Outcome Reward Model (ORM)**. Instead of just saying if the final answer is right or wrong, the ORM evaluates the entire reasoning process. If it finds an error, it pinpoints the exact faulty step and provides a detailed analysis of why the mistake occurred. Correct reasoning paths are then used to further fine-tune the model, creating a continuous learning loop where the model learns from its previous attempts.
The final stage is **Reflection**. Incorrect reasoning paths, along with the error steps and analyses provided by the ORM, are fed back to a powerful language model (like GPT-4o). This model then reflects on the mistakes and generates corrected reasoning paths. These refined paths are incorporated into the training data, further enhancing the model’s ability to recognize and correct its errors, deepening its understanding of underlying reasoning flaws.
This iterative process allows MathSE to progressively improve its problem-solving skills, effectively bridging the gap between static, teacher-derived datasets and the dynamic learning process seen in human students. As the self-evolving process continues, the model consistently classifies more examples correctly, showing a clear improvement in overall accuracy.
Also Read:
- MENTOR: Teaching LLMs to Self-Assess and Mitigate Hidden Dangers in Domain Tasks
- AI Agents Collaborate to Uncover New Scientific Machine Learning Methods
Impressive Results
The effectiveness of MathSE was tested on several challenging benchmarks, including MathVista, MathVL-test, MathVerse, and Math-Vision. The framework demonstrated significant performance gains over existing backbone models. For instance, on the MathVL-test, MathSE-InternVL achieved an accuracy of 65%, outperforming leading open-source multimodal mathematical reasoning models. Across various benchmarks, MathSE significantly boosted the performance of base models, with average score increases of up to 15.91%.
Ablation studies confirmed the importance of each component. The ORM’s detailed error feedback proved more effective than simple binary (correct/incorrect) feedback, leading to higher accuracy. The self-evolving training data, which combines GPT-4o generated paths with model-refined paths, also outperformed datasets generated solely by GPT-4o, highlighting the value of iterative adaptability.
In conclusion, MathSE offers a novel and effective approach to improving multimodal mathematical reasoning in MLLMs. By integrating iterative fine-tuning, reward-guided feedback, and a robust reflection mechanism, the framework enables models to continuously enhance their reasoning abilities, leading to state-of-the-art performance on complex math problems. For more details, you can refer to the full research paper here.


