TLDR: GeoVLMath is a new vision-language model (LVLM) designed to enhance AI’s ability to solve complex solid geometry problems. It achieves this by generating textual descriptions of auxiliary lines, which are crucial for revealing hidden geometric structures. The model uses a novel cross-modal reward system to ensure these textual descriptions accurately align with geometric diagrams, without requiring image editing or precise coordinate data. Trained on the new AuxSolidMath dataset, GeoVLMath (3B/7B) demonstrates competitive and often superior performance against larger LVLMs, highlighting the effectiveness of geometry-aware supervision over mere model scale.
Solving complex geometry problems often requires a unique human intuition: drawing auxiliary lines. These are extra lines or coordinate systems added to a diagram to reveal hidden structures and simplify multi-step reasoning. However, this crucial step has been a significant challenge for large vision-language models (LVLMs), which are AI systems designed to understand both images and text.
A new research paper introduces GeoVLMath, an innovative approach that tackles this challenge head-on. Instead of trying to directly edit diagrams to draw these lines, which current image editing AI struggles to do with geometric precision, GeoVLMath generates textual descriptions of these auxiliary line constructions. This method aligns better with how LVLMs process information.
Bridging the Gap Between Text and Space
At the heart of GeoVLMath is a reinforcement learning framework designed to enhance the alignment between textual descriptions and the spatial structure of geometric diagrams. The core innovation is a ‘cross-modal reward’ system. This system evaluates how accurately a generated textual description of an auxiliary line matches a ground-truth diagram that already includes the correct auxiliary lines. This fine-grained feedback helps the model learn to create precise and relevant auxiliary line descriptions.
The researchers conducted a pilot study demonstrating the critical role of accurate auxiliary lines. Using correct auxiliary lines led to the highest accuracy in problem-solving, while incorrect ones resulted in the poorest performance, even worse than not using any auxiliary lines at all. This highlights the need for reliable auxiliary line generation.
Overcoming Current Limitations
Previous attempts to incorporate auxiliary lines into AI models faced significant hurdles. Direct image editing models often fail to draw lines with the necessary geometric accuracy. Other approaches, like tool-use pipelines, depend on precise coordinate positions of diagram elements, which are rarely available in real-world problems and require the LVLM to generate highly accurate code.
GeoVLMath bypasses these limitations by focusing on textual descriptions. The cross-modal reward model measures the consistency between the generated text and the ground-truth diagram, providing geometry-aware supervision without needing coordinate assumptions or image manipulation.
The Training Process and Dataset
The training of GeoVLMath follows a two-stage paradigm. First, a supervised fine-tuning (SFT) stage provides a ‘cold start’ by training the model on examples with explicit auxiliary line steps. This is followed by a reinforcement learning (RL) stage, using Group Relative Policy Optimization (GRPO), which further refines the model’s ability to construct auxiliary lines that accurately reflect the diagram’s geometry.
To support this training, the researchers developed a robust and scalable data creation pipeline, resulting in AuxSolidMath. This open-source dataset comprises 3,018 real-exam solid geometry problems, each with paired diagrams (original and auxiliary-line annotated) and aligned textual fields. AuxSolidMath is the first dataset specifically designed for auxiliary-line-based solid geometry reasoning. You can find more details about the paper and the dataset at this link.
Also Read:
- Bridging Vision and Text for Better Geometric Reasoning in AI
- Enhancing Mathematical Reasoning with Code-Driven Visual Thinking
Performance and Impact
GeoVLMath, available at 3B and 7B parameter scales, demonstrates competitive and often superior performance compared to strong open-source and proprietary LVLMs on auxiliary-line reasoning benchmarks. Notably, GeoVLMath-7B outperformed larger models like Qwen2.5-VL-32B-Instruct and GPT-4o on certain tasks, suggesting that geometry-aware supervision is more effective than simply scaling model parameters.
Ablation studies further confirmed the importance of the cross-modal reward and the reinforcement learning stage. Removing the cross-modal reward or replacing it with a purely textual similarity objective led to significant performance drops, emphasizing that robust auxiliary-line reasoning requires visually grounded, structure-preserving diagram-text alignment.
This work represents a significant step forward in enabling AI to tackle more complex geometric problems, particularly in solid geometry, by effectively integrating the crucial concept of auxiliary line constructions into their reasoning processes.


