TLDR: PyVision is a novel agentic framework that enables multimodal large language models (MLLMs) to dynamically generate and execute Python code for visual reasoning tasks. Unlike traditional systems with static toolsets, PyVision allows MLLMs to invent custom tools on demand, leading to significant performance improvements across diverse benchmarks and fostering more flexible, interpretable, and adaptive problem-solving in computer vision.
In the rapidly evolving landscape of artificial intelligence, large language models (LLMs) are increasingly taking on roles as agents capable of planning, reasoning, and interacting with external tools. However, when it comes to visual reasoning, many existing approaches have been limited by predefined workflows and static sets of tools. This often restricts their flexibility and adaptability in tackling complex visual puzzles.
A groundbreaking new framework called PyVision aims to change this by empowering multimodal LLMs (MLLMs) to autonomously generate, execute, and refine Python-based tools tailored specifically to the task at hand. This innovative approach unlocks a new level of flexible and interpretable problem-solving in the visual domain.
What is PyVision and How Does It Work?
PyVision operates as an interactive, multi-turn framework. Imagine an MLLM receiving a visual query; instead of relying on a fixed set of pre-programmed functions, PyVision enables the MLLM to write its own Python code in response. This code is then executed within an isolated environment, and the resulting output – which can be text, images, or both – is fed back to the MLLM. This creates a continuous loop where the model can iterate and refine its reasoning over multiple turns until it arrives at a final answer.
The power of PyVision lies in its ability to leverage Python’s extensive ecosystem of scientific and vision libraries, such as OpenCV, Pillow, NumPy, Pandas, Scikit-learn, and Scikit-image. These libraries serve as the fundamental building blocks, allowing the MLLM to construct highly adaptive tools on the fly, rather than being confined to a limited, pre-defined toolkit.
The Dynamic Tooling Advantage
The research paper categorizes the types of tools PyVision can generate into several broad classes:
- Basic Image Processing: This includes fundamental operations like cropping to focus on specific regions, rotating misaligned images, and enhancing contrast to make subtle details more visible.
- Advanced Image Processing: PyVision can dynamically create tools for more complex tasks such as segmenting specific regions, detecting objects by generating bounding boxes, and even performing Optical Character Recognition (OCR) to extract text from images without external APIs.
- Visual Prompting and Sketching: To aid its own reasoning, PyVision can annotate images with auxiliary markings, like dots for counting objects or lines for geometric analysis, essentially creating visual notes.
- Numerical and Statistical Analysis: For quantitative reasoning, PyVision can generate code to analyze pixel intensity distributions (histograms) or compute metrics like areas and lengths for symbolic reasoning.
- Long-Tail Operations: Beyond these categories, PyVision demonstrates the ability to invent novel, task-specific tools, such as directly subtracting pixel values between two images to find differences in a “spot the difference” puzzle.
This dynamic tool generation allows PyVision to adapt its strategy to the unique demands of each visual task, moving beyond superficial pattern matching to grounded, verifiable visual reasoning.
Also Read:
- Unlocking Deeper AI Understanding of Human Videos with HV-MMBench
- Tempo-R0: Advancing Video Understanding with Enhanced Temporal Grounding
Impact and Performance
The effectiveness of PyVision has been demonstrated across various benchmarks. It consistently improves the performance of strong backend models like GPT-4.1 and Claude-4.0-Sonnet. For instance, PyVision-GPT-4.1 showed a significant +7.8% gain on the V* fine-grained visual search benchmark, while PyVision-Claude-4.0-Sonnet achieved a remarkable +31.1% increase on VLMsAreBlind-mini, a symbolic vision task.
Interestingly, the study found that PyVision acts as an amplifier, boosting the backend model’s inherent strengths. If a model is stronger in abstract reasoning, PyVision enhances that; if it’s stronger in perception, PyVision amplifies its perceptual capabilities. This suggests a crucial interplay between the model’s foundational strengths and the benefits of dynamic tooling.
In essence, PyVision represents a significant step forward in multimodal reasoning. By empowering AI models to invent new computational tools on the fly, it paves the way for more versatile, autonomous, and genuinely creative AI systems capable of adapting to complex real-world visual reasoning scenarios. You can read the full research paper here.


