TLDR: AstroMMBench is the first benchmark for evaluating multimodal large language models (MLLMs) in astronomical image interpretation. It features 621 expert-reviewed multiple-choice questions across six astrophysics subfields. Evaluations of 25 MLLMs showed the open-source Ovis2-34B as the top performer, even surpassing leading closed-source models. The benchmark highlights the potential of open-source models in specialized scientific tasks and reveals varied performance across different astronomical domains, with some subfields proving more challenging for current MLLMs.
Astronomical image interpretation is a complex and crucial task for understanding the universe, but it poses a significant challenge for modern artificial intelligence, specifically Multimodal Large Language Models (MLLMs). These advanced AI models, which combine the power of language understanding with visual comprehension, have struggled to accurately interpret the specialized and intricate data found in astronomy.
To address this critical gap, researchers have introduced AstroMMBench, the first comprehensive benchmark designed specifically to evaluate how well MLLMs can understand astronomical images. This new benchmark aims to provide a standardized way to measure and guide the development of AI models for scientific applications in astronomy.
AstroMMBench is built upon 621 carefully crafted multiple-choice questions. These questions span six major subfields of astrophysics, including Astrophysics of Galaxies, Cosmology and Nongalactic Astrophysics, Earth and Planetary Astrophysics, High Energy Astrophysical Phenomena, Instrumentation and Methods for Astrophysics, and Solar and Stellar Astrophysics. To ensure the highest quality and relevance, these questions were curated and rigorously reviewed by a panel of 15 domain experts, each holding advanced degrees in astronomy or related fields.
The creation of AstroMMBench involved an innovative automated pipeline. This process began by collecting image-text pairs from recent astrophysical papers on arXiv, focusing on submissions between January and July 2024. An AI model, LLaMA3.3-70B-Instruct, was then used to refine the textual descriptions, ensuring clarity and consistency. Following this, InternVL2.5-78B generated the multiple-choice questions. A multi-stage review process, involving five other large language models and ultimately human experts, filtered these questions to ensure they required genuine visual understanding and specialized astronomical knowledge.
An extensive evaluation was conducted using AstroMMBench on 25 diverse MLLMs. This included 22 open-source models and 3 powerful closed-source models. The evaluation utilized the VLMEvalKit framework, with accuracy as the primary metric. The results revealed significant variations in performance across these models.
Remarkably, the open-source Ovis2-34B model achieved the highest overall accuracy, scoring 70.53%. This performance surpassed even leading closed-source models like ChatGPT-4o (69.07%) and Doubao-1.5-Vision-Pro (68.12%). This finding highlights the rapid advancements and strong potential of open-source MLLMs in tackling specialized scientific tasks.
The study also found a strong positive correlation (Pearson correlation coefficient r=0.82) between a model’s general multimodal capabilities (as measured by OpenCompass scores) and its performance on AstroMMBench. This suggests that models that perform well on general tasks tend to also do well in astrophysics. However, there were exceptions, indicating that domain-specific challenges in astronomy require more than just general AI prowess.
Performance varied significantly across the different astrophysical subfields. Models generally performed better in areas like Instrumentation and Methods for Astrophysics (IM) and Solar and Stellar Astrophysics (SR). These subfields often involve interpreting standard astronomical plots and recognizing common objects, skills that might align with general visual training. Conversely, domains such as Cosmology and Nongalactic Astrophysics (CO) and High Energy Astrophysical Phenomena (HE) proved more challenging. These areas typically demand a deeper understanding of abstract theoretical concepts and the interpretation of highly specialized or unconventional visualizations.
Also Read:
- Decoding How AI Understands the World: A Multimodal Perspective
- Unveiling VLM Limitations in Visually Complex Environments
AstroMMBench serves as a foundational resource and a dynamic tool to drive progress at the intersection of AI and astronomy. While the current benchmark size and task diversity have limitations, future work aims to expand it with more diverse question types and to further refine the automated question generation process. For more details, you can read the full research paper here.


