TLDR: FractalBench is a new benchmark evaluating multimodal AI systems’ ability to synthesize fractal programs from images. It tests visual-mathematical reasoning, specifically recursive program synthesis. The study found that while MLLMs (GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, Qwen 2.5-VL) can generate syntactically valid code (76% success), only a small fraction (4%) accurately reproduce the mathematical structure of fractals. Models perform better on geometric transformations (Koch curves: 17-21%) but fail significantly at branching recursion (trees: <2%), indicating a lack of true recursive abstraction. Surprisingly, direct code generation outperformed reasoning-first prompts.
Mathematical reasoning is a cornerstone of intelligence, requiring the ability to abstract symbolic rules from visual patterns and infer infinite processes from finite observations. In the rapidly evolving field of artificial intelligence, a critical question arises: can multimodal AI systems, which combine visual and language understanding, truly grasp this complex form of reasoning?
A new research paper introduces a benchmark called FractalBench, designed to diagnose visual-mathematical reasoning in leading multimodal large language models (MLLMs). The study investigates whether these AI systems can synthesize executable Python code to reproduce fractals from images, thereby evaluating their capacity to bridge visual perception with mathematical abstraction.
Why Fractals?
Fractals are an ideal testbed for this challenge. They are geometric shapes that exhibit self-similarity, meaning they look roughly the same at any scale. Despite their often intricate appearance, fractals can be compactly defined by simple recursive rules known as Iterated Function Systems (IFS). This characteristic makes them perfect for testing an AI’s ability to infer these underlying generative processes from visual evidence. Successfully synthesizing fractal code demands several interconnected capabilities: recognizing patterns that repeat at different scales, inferring precise geometric transformations (like rotations and scaling), and understanding the recursive nature of their generation rather than just memorizing visible patterns.
Introducing FractalBench
FractalBench comprises 12 canonical fractals, each presenting distinct mathematical challenges. These include: Koch curves, which test geometric transformations; Sierpiński structures, probing multi-scale self-similarity; dragon curves, evaluating space-filling navigation; and tree fractals, assessing branching recursion. The benchmark uses 610 unique test images, generated with varying depths and colors to prevent models from relying on cached visual embeddings of common black fractals, ensuring genuine visual-mathematical reasoning.
The evaluation uses a minimalist ‘MinimalTurtle’ interface, which provides basic drawing commands like `move`, `turn`, `pen_up`, and `pen_down`. This intentional constraint forces models to abstract visual-to-symbolic rules rather than relying on complex library functions or memorized syntax, thus isolating the core reasoning capability.
Key Findings: A Striking Disconnect
The researchers evaluated four prominent MLLMs: GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash, and Qwen 2.5-VL, using three prompting strategies (Direct Code Generation, Reasoning Then Code, and Recursive Structure Focus). The results revealed a significant gap between syntactic competence and semantic understanding. While a high percentage (76.1%) of the generated code was syntactically valid and executed successfully, only a mere 4.2% produced visually correct fractals. This indicates that models can generate functional Python code, but often fail to infer the *correct* generative mathematical rule, instead implementing some recursive pattern that doesn’t match the target fractal.
Performance varied systematically across fractal types. Koch fractals, which primarily rely on iterative geometric transformations, achieved the highest success rates (17-21%). This suggests that models can compose basic geometric operations. However, even here, an 80% failure rate highlights a limitation: geometric intuition alone is insufficient without true recursive abstraction. Sierpiński fractals showed moderate performance (3-18%), indicating models recognize visual similarity but struggle to infer precise scale invariance. Tree fractals, despite having simpler mathematical definitions, proved catastrophically difficult, with less than 2% accuracy. This failure points to a specific bottleneck: branching recursion, where a single parent spawns multiple independent recursive children. Models often substituted iterative loops or single-branch recursion, failing to represent the exponentially growing tree-structured computation graphs.
Intriguingly, the study found that direct code generation often outperformed reasoning-first approaches, which is counterintuitive to the typical chain-of-thought advantages seen in other mathematical reasoning tasks. The researchers hypothesize that verbose intermediate reasoning might interfere with precise visual-to-code synthesis, possibly by anchoring models on high-level descriptions that are difficult to translate into exact geometric parameters.
Also Read:
- Enhancing AI’s Math Skills: A Self-Evolving Approach to Multimodal Reasoning
- Evaluating AI’s Coding Prowess in Ukrainian: Introducing UA-Code-Bench
Implications for AI
FractalBench provides a crucial diagnostic framework for understanding visual-mathematical reasoning in AI systems. The findings suggest that current MLLMs possess geometric capabilities but fundamentally lack recursive mathematical abstraction. This work offers a contamination-resistant method for evaluating progress in AI’s ability to integrate visual perception with symbolic mathematical reasoning, with implications for various domains, including educational AI, formal verification tools, and scientific discovery pipelines. For more details, you can read the full research paper here.


