TLDR: A new benchmark called AVI-MATH has been introduced to test how well AI models can perform mathematical reasoning using drone images. The study found that current vision-language models struggle with complex tasks like calculating distances or areas from aerial views, often lacking domain-specific knowledge and visual perception for small objects. While techniques like fine-tuning and Chain-of-Thought prompting show some improvement, there’s a significant gap between open-source models and advanced ones like GPT-4o, highlighting the need for more robust AI for autonomous drone applications.
A new research paper introduces AVI-MATH, the first benchmark designed to rigorously evaluate how well vision-language models (VLMs) can perform mathematical reasoning using images captured by aerial vehicles. This is a crucial step for tasks like precise distance and area calculations, trajectory estimations, and spatial analysis in remote sensing, which are vital for autonomous drone systems.
Current vision-language models, despite their successes in other areas, haven’t been adequately tested in this specific domain. Existing datasets for remote sensing visual question answering (VQA) often focus on simpler visual perception tasks or basic counting, rather than complex mathematical problems that require domain-specific knowledge in geometry, logic, and algebra.
To address this gap, the researchers developed AVI-MATH, a comprehensive benchmark featuring 3,773 high-quality, vehicle-related questions derived from drone imagery. These questions span six mathematical subjects: geometry, logic, statistics, arithmetic, counting, and algebra, covering 20 distinct topics. The data was collected under various real-world drone scenarios, including different altitudes and camera angles, ensuring the problems are diverse and complex.
The study benchmarked 14 prominent VLMs and found that these models generally struggle with the reasoning tasks in AVI-MATH. Even advanced models like GPT-4o, while performing better than others, still showed significant limitations. For instance, GPT-4o achieved an overall accuracy of only 34.6%, highlighting a substantial gap in the mathematical reasoning capabilities of current VLMs when applied to aerial imagery.
The analysis revealed several key limitations. Models using older vision encoders, like CLIP-ViT, performed poorly due to constraints on processing long visual token sequences, which are essential for retaining detailed information from high-resolution images. The research also found that a correct answer doesn’t always mean the model used correct reasoning; sometimes, models arrived at the right answer without a full understanding of the underlying logic. A major cause of errors was the lack of domain-specific knowledge in remote sensing, followed by insufficient visual perception for small objects in complex drone images.
The importance of higher input resolution for these tasks was also emphasized. While AVI-MATH images are 4K resolution, models often downsample them, leading to a loss of crucial object details, especially for vehicles captured from high altitudes. The study showed that increasing resolution generally improved performance, though the gains were sometimes less than expected, possibly due to how foundation models are trained or the visual encoder’s token compression.
Furthermore, the difficulty of questions was positively correlated with the number of reasoning steps required. Questions demanding more steps proved significantly harder for all models. Interestingly, fine-tuning models with task-specific remote sensing instruction sets sometimes degraded their generalization abilities, suggesting that current fine-tuning methods might lead to data fitting rather than true intelligence.
The researchers explored techniques like Chain-of-Thought prompting and parameter-efficient fine-tuning (LoRA) using a new 215k-sample instruction set called AVI-MATH-215K. Both methods showed promise in improving VLM performance on the benchmark, with LoRA fine-tuning leading to substantial accuracy gains for some models. However, Chain-of-Thought prompting was not universally beneficial, with some models experiencing performance drops, indicating that its effectiveness depends on the model’s inherent multi-step reasoning capabilities.
Also Read:
This work not only exposes the limitations of current VLMs in mathematical reasoning for aerial vehicle imagery but also offers valuable insights for advancing trustworthy, UAV-based VLMs for real-world applications. The code and datasets are publicly available at https://github.com/VisionXLab/avi-math, providing a valuable resource for future research in this critical area.


