TLDR: DiagramIR is an automatic pipeline that evaluates educational math diagrams generated by LLMs. It works by translating LaTeX TikZ code into an intermediate representation (IR) and then applying rule-based checks for mathematical and spatial correctness. This method shows higher agreement with human raters than LLM-as-a-Judge approaches and allows smaller, more cost-effective models to perform comparably to larger ones, making AI-powered education tools more scalable and accessible.
Large Language Models (LLMs) are becoming increasingly popular as learning tools, but their primary reliance on text limits their effectiveness in subjects like mathematics, where visual aids are crucial. While LLMs can generate educational figures, a significant challenge has been the scalable and accurate evaluation of these diagrams.
Addressing this challenge, researchers from Stanford University and KTH Royal Institute of Technology have introduced DiagramIR, an automatic and scalable evaluation pipeline specifically designed for geometric figures. This innovative method leverages intermediate representations (IRs) of LaTeX TikZ code, which is a common way to create diagrams programmatically.
The core idea behind DiagramIR is “back-translation.” This involves an LLM translating the complex TikZ code into a more structured, machine-interpretable intermediate representation. Once in this simplified IR format, a series of rule-based checks can be automatically applied to evaluate the diagram. These checks assess various aspects, including mathematical correctness (e.g., labeled angles matching drawn angles, proportions) and spatial correctness (e.g., diagram fully in frame, elements readable, no problematic overlaps).
The researchers compared DiagramIR against other evaluation methods, such as “LLM-as-a-Judge,” where an LLM directly evaluates the diagram or its code. Their findings show that DiagramIR achieves higher agreement with human raters. A particularly exciting outcome is that DiagramIR enables smaller, more cost-effective models like GPT-4.1-Mini to perform comparably to much larger models such as GPT-5, but at a significantly lower inference cost (up to 10 times less). This cost efficiency is vital for making AI-powered education technologies accessible and scalable.
The evaluation dataset used for DiagramIR is grounded in real-world scenarios, drawing from conversational data between teachers and an AI assistant for mathematics educators. This focus on geometric constructions, such as 2D and 3D shapes, ensures the pipeline addresses common diagram requests encountered in educational settings.
While DiagramIR marks a significant step forward, the authors acknowledge certain limitations. The current rubric primarily focuses on mathematical and spatial correctness, leaving out the subjective aspect of pedagogical usefulness. Future work aims to expand the intermediate representation to cover more complex diagrams and integrate the method directly into diagram-generation tools. For more details, you can read the full research paper here.
Also Read:
- Enhancing AI’s Math Skills: A Self-Evolving Approach to Multimodal Reasoning
- Adaptive Testing Reshapes LLM Evaluation for Efficiency and Accuracy
In conclusion, DiagramIR offers a promising solution for the automated evaluation of mathematical diagrams, paving the way for more reliable, affordable, and scalable AI tools in education. By combining symbolic abstraction with lightweight inference, it empowers even smaller LLMs to contribute effectively to diagram assessment.


