TLDR: Researchers introduce GeoRef, a new benchmark and task for teaching AI to identify and interpret geometric elements in diagrams using natural language queries. They developed a large synthetic dataset and advanced fine-tuning methods, including a reinforced learning approach (GRPO) and a self-correction mechanism, which significantly improved AI’s ability to understand geometric visuals. This foundational work also enhances AI’s performance on broader geometric reasoning tasks, addressing a critical gap in multimodal AI capabilities.
Artificial intelligence has made incredible strides in understanding language and images, but when it comes to solving geometric problems, a significant challenge remains: truly understanding the diagrams. Unlike purely text-based math, geometry demands that AI models not only reason logically but also accurately interpret visual elements like points, lines, angles, and shapes, and understand their spatial relationships.
Current AI models, particularly advanced Multimodal Large Language Models (MLLMs) that combine vision and language capabilities, often struggle with this fundamental aspect. They might arrive at a correct answer, but without genuinely understanding the diagram, much like a student who guesses correctly without grasping the underlying concepts. This gap in what researchers call ‘geometric grounding’ means AI often bypasses the crucial step of interpreting the visual information.
Introducing GeoRef: A New Task for Geometric Understanding
To address this, a team of researchers from the University of Electronic Science and Technology of China and Tongji University introduced a new task called Referring Expression Comprehension (REC) for geometric problems. This task is designed to specifically evaluate whether AI models can correctly identify, interpret, and locate geometric elements in diagrams based on natural language queries. Imagine asking an AI, “Which point is the center of the circle?” and expecting it to accurately pinpoint ‘O’ in the diagram.
To support this new task, they developed GeoRef, a benchmark dataset. Built upon existing geometric problem collections, GeoRef features high-quality annotations for a diverse range of geometric elements and relationships, covering typical middle school geometry topics. However, creating such a dataset manually is incredibly time-consuming and difficult to scale.
Synthetic Data and Advanced Training Methods
To overcome the data scarcity, the researchers devised an ingenious solution: generating a large-scale synthetic training dataset. They used a structured geometric formal language, leveraging a system called Penrose, which allows for precise control over diagram composition. This approach ensures the dataset is scalable, mathematically consistent, and covers a broad spectrum of geometric concepts.
The paper, titled GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions, explores two main fine-tuning approaches for training AI models on this task: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). GRPO, a reinforcement learning method, proved significantly more effective than SFT. It works by aligning the model’s behavior with specific rewards for geometric correctness, helping it learn preferences in a more efficient way.
Furthermore, the team introduced a novel “verify-and-regenerate” mechanism. This clever self-correction system allows the AI to detect incorrect predictions and then re-infer answers by using its contextual reasoning history. Essentially, the AI generates an initial answer, a ‘verifier’ checks its validity and provides feedback, and then the AI ‘regenerates’ a more accurate response based on this feedback loop. This mechanism further boosted accuracy and robustness.
Also Read:
- DeFacto: Enhancing AI’s Visual Reasoning with Counterfactual Training
- Unlocking Dynamic Problem-Solving in AI with Explanatory Verifiers
Key Findings and Future Impact
The experiments revealed that even state-of-the-art MLLMs struggle with geometric REC, underscoring the necessity of explicitly evaluating and strengthening geometric grounding. However, models trained on GeoRef, especially with the GRPO and verify-and-regenerate mechanisms, showed significant improvements. For instance, GRPO alone provided a substantial performance gain over SFT, and the verify-and-regenerate mechanism further enhanced accuracy, particularly for tasks involving localized visual recognition.
Crucially, the research demonstrated that models trained on GeoRef also showed measurable improvements on downstream geometric reasoning tasks. This highlights the broader value of REC as a foundational capability for enhancing multimodal mathematical understanding in AI systems. By teaching AI to truly ‘see’ and interpret geometric diagrams, GeoRef paves the way for more robust and genuinely intelligent AI for geometric problem-solving.


