TLDR: BMMR is a large, bilingual, multimodal, and multidisciplinary dataset (110k college-level questions across 300 subjects) designed to evaluate and develop Large Multimodal Models (LMMs). It includes BMMR-Eval for assessment and BMMR-Train for fine-tuning. The paper also introduces BMMR-Verifier for detailed reasoning path evaluation. Experiments show current LMMs have significant room for improvement, open-source models benefit greatly from BMMR-Train, and models exhibit discipline-specific biases and common errors like overthinking and hallucination.
The rapid advancements in Large Multimodal Models (LMMs) and Large Reasoning Models (LRMs) have opened up new possibilities for artificial intelligence, allowing them to process and reason across both text and visual information. These models are showing impressive capabilities in fields like mathematics, physics, and chemistry. However, evaluating their true knowledge and reasoning abilities across a wide range of academic disciplines has become increasingly difficult.
Existing benchmarks often fall short in balancing subject diversity, problem complexity, reasoning depth, and language coverage. Many current evaluation methods are also starting to show signs of performance saturation, meaning top models are hitting a ceiling on these tests. Furthermore, the AI community, especially open-source developers, has been lacking a comprehensive multimodal, multidisciplinary training dataset that includes diverse questions and detailed reasoning paths.
Introducing BMMR: A Comprehensive Dataset for Advanced AI Reasoning
To address these critical gaps, researchers have introduced BMMR (Bilingual Multimodal Multi-Discipline Reasoning), a groundbreaking dataset designed to push the boundaries of LMM evaluation and development. BMMR is a massive collection of 110,000 college-level questions, covering an impressive 300 subjects defined by UNESCO. This broad coverage spans eight high-level disciplines, ensuring a truly multidisciplinary assessment.
The dataset is unique in several ways:
- Bilingual Support: BMMR includes questions in both English and Chinese, allowing for the evaluation of cross-lingual reasoning capabilities.
- Diverse Formats: Questions are sourced from various print and digital media, such as books, exams, and quizzes, and come in multiple formats, including multiple-choice, fill-in-the-blank, and open-ended questions. This variety helps prevent models from simply memorizing answers.
- High-Quality Curation: All data undergoes a rigorous human-in-the-loop and scalable curation process. Each question is paired with a high-quality reasoning path, ensuring that every instance demands precise cross-modal comprehension, specialized domain knowledge, and advanced reasoning skills.
BMMR is divided into two main parts: BMMR-Eval and BMMR-Train. BMMR-Eval consists of 20,458 high-quality instances specifically designed to comprehensively assess LMMs’ knowledge and reasoning across disciplines in both Chinese and English. BMMR-Train, with 88,991 instances, is intended to support further research and development, helping to extend the focus of current models beyond just mathematical reasoning to a wider array of disciplines.
BMMR-Verifier: Evaluating the Reasoning Process
Beyond just checking the final answer, understanding how a model arrives at its conclusion is crucial. To enable accurate and fine-grained evaluation of reasoning paths, the researchers also propose BMMR-Verifier. This process-based, bilingual, multimodal, and multidisciplinary verifier scores each step of a model’s reasoning path, helping to identify flaws even if the final answer is correct, and preventing models from simply guessing or recalling information.
Also Read:
- Unlocking Deeper AI Understanding of Human Videos with HV-MMBench
- M3-Med: Advancing AI’s Understanding of Medical Instructional Videos
Key Findings from Extensive Experiments
The research paper details extensive experiments conducted on 24 different LMMs and LRMs. The findings highlight several important insights:
- Significant Headroom for SOTA Models: Even state-of-the-art models like GPT-4o and Gemini-2.5-Pro show substantial room for improvement on BMMR-Eval, indicating the challenging nature of the dataset.
- Discipline Bias in Reasoning Models: Contrary to expectations, reasoning models do not consistently outperform LMMs across all disciplines. They often exhibit a bias, excelling in specific subjects like mathematical reasoning but performing less effectively in others.
- Open-Source Models Lag Behind: Open-source models generally trail their proprietary counterparts, underscoring a potential gap in the availability of diverse, high-quality training data for the open-source community.
- BMMR-Train Narrows the Gap: Fine-tuning open-source models on BMMR-Train significantly improves their performance, demonstrating the dataset’s value in advancing model capabilities. For instance, fine-tuning InternVL2.5-78B led to a 19.07% improvement in overall performance.
Further analysis using BMMR-Verifier revealed that the quality of reasoning steps is a key factor in overall model performance. Models often struggle with disciplinary knowledge, calculation, derivation, and general reasoning errors. Common failure modes observed include ‘overthinking,’ where models engage in excessive deliberation, and ‘hallucination,’ where they generate incorrect or fabricated information, especially when visual input is disregarded.
The introduction of BMMR and BMMR-Verifier marks a significant contribution to the AI community, providing robust tools for both evaluating and developing more capable and reliable multimodal models. The dataset and its associated findings offer valuable insights for future research aimed at building next-generation AI systems with stronger multidisciplinary reasoning abilities. You can find more details about this research in the full paper: BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset.


