AstroMMBench: A New Standard for AI in Astronomical Image Understanding

TLDR: AstroMMBench is the first benchmark for evaluating multimodal large language models (MLLMs) in astronomical image interpretation. It features 621 expert-reviewed multiple-choice questions across six astrophysics subfields. Evaluations of 25 MLLMs showed the open-source Ovis2-34B as the top performer, even surpassing leading closed-source models. The benchmark highlights the potential of open-source models in specialized scientific tasks and reveals varied performance across different astronomical domains, with some subfields proving more challenging for current MLLMs.

Astronomical image interpretation is a complex and crucial task for understanding the universe, but it poses a significant challenge for modern artificial intelligence, specifically Multimodal Large Language Models (MLLMs). These advanced AI models, which combine the power of language understanding with visual comprehension, have struggled to accurately interpret the specialized and intricate data found in astronomy.

To address this critical gap, researchers have introduced AstroMMBench, the first comprehensive benchmark designed specifically to evaluate how well MLLMs can understand astronomical images. This new benchmark aims to provide a standardized way to measure and guide the development of AI models for scientific applications in astronomy.

AstroMMBench is built upon 621 carefully crafted multiple-choice questions. These questions span six major subfields of astrophysics, including Astrophysics of Galaxies, Cosmology and Nongalactic Astrophysics, Earth and Planetary Astrophysics, High Energy Astrophysical Phenomena, Instrumentation and Methods for Astrophysics, and Solar and Stellar Astrophysics. To ensure the highest quality and relevance, these questions were curated and rigorously reviewed by a panel of 15 domain experts, each holding advanced degrees in astronomy or related fields.

The creation of AstroMMBench involved an innovative automated pipeline. This process began by collecting image-text pairs from recent astrophysical papers on arXiv, focusing on submissions between January and July 2024. An AI model, LLaMA3.3-70B-Instruct, was then used to refine the textual descriptions, ensuring clarity and consistency. Following this, InternVL2.5-78B generated the multiple-choice questions. A multi-stage review process, involving five other large language models and ultimately human experts, filtered these questions to ensure they required genuine visual understanding and specialized astronomical knowledge.

An extensive evaluation was conducted using AstroMMBench on 25 diverse MLLMs. This included 22 open-source models and 3 powerful closed-source models. The evaluation utilized the VLMEvalKit framework, with accuracy as the primary metric. The results revealed significant variations in performance across these models.

Remarkably, the open-source Ovis2-34B model achieved the highest overall accuracy, scoring 70.53%. This performance surpassed even leading closed-source models like ChatGPT-4o (69.07%) and Doubao-1.5-Vision-Pro (68.12%). This finding highlights the rapid advancements and strong potential of open-source MLLMs in tackling specialized scientific tasks.

The study also found a strong positive correlation (Pearson correlation coefficient r=0.82) between a model’s general multimodal capabilities (as measured by OpenCompass scores) and its performance on AstroMMBench. This suggests that models that perform well on general tasks tend to also do well in astrophysics. However, there were exceptions, indicating that domain-specific challenges in astronomy require more than just general AI prowess.

Performance varied significantly across the different astrophysical subfields. Models generally performed better in areas like Instrumentation and Methods for Astrophysics (IM) and Solar and Stellar Astrophysics (SR). These subfields often involve interpreting standard astronomical plots and recognizing common objects, skills that might align with general visual training. Conversely, domains such as Cosmology and Nongalactic Astrophysics (CO) and High Energy Astrophysical Phenomena (HE) proved more challenging. These areas typically demand a deeper understanding of abstract theoretical concepts and the interpretation of highly specialized or unconventional visualizations.

Also Read:

AstroMMBench serves as a foundational resource and a dynamic tool to drive progress at the intersection of AI and astronomy. While the current benchmark size and task diversity have limitations, future work aims to expand it with more diverse question types and to further refine the automated question generation process. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AstroMMBench: A New Standard for AI in Astronomical Image Understanding

Gen AI News and Updates

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

A New Way to Disentangle Data for Scientific Exploration

Google Unveils Free 5-Day AI Agents Intensive Course on Kaggle

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates