Introducing FinMR: A New Benchmark for Advanced Financial AI Reasoning

TLDR: FinMR is a new, high-quality, knowledge-intensive multimodal dataset designed to evaluate advanced financial reasoning capabilities of Multimodal Large Language Models (MLLMs) at a professional analyst’s standard. It comprises over 3,200 expertly annotated question-answer pairs across 15 financial topics, integrating mathematical reasoning, financial knowledge, and diverse visual interpretation tasks. Benchmarking reveals a significant performance gap between current MLLMs and human financial analysts, highlighting areas for improvement in image analysis, formula application, and contextual understanding, especially for open-source models.

In the rapidly evolving landscape of artificial intelligence, Multimodal Large Language Models (MLLMs) are making significant strides, combining the power of language understanding with visual interpretation. However, evaluating these advanced models in highly specialized fields like finance has been a considerable challenge due to the lack of suitable datasets. This is where FinMR comes into play, a groundbreaking new benchmark designed to rigorously assess MLLMs’ capabilities in expert-level financial reasoning.

Developed by a team of researchers from the University of Auckland and Nanyang Technological University, FinMR addresses a critical gap in the AI research community. Existing datasets often fall short, either lacking the professional depth of financial knowledge, the complexity of reasoning tasks, or the diversity of visual content necessary to truly test models against the standards of a professional financial analyst.

FinMR stands out with over 3,200 meticulously curated and expertly annotated question-answer pairs. These questions span 15 diverse financial topics, ensuring a broad coverage of the domain. What makes FinMR particularly unique is its integration of sophisticated mathematical reasoning, advanced financial knowledge, and nuanced visual interpretation tasks across various image types. This includes everything from statistical charts and time series graphs to financial tables, specialized diagrams, and even geographical maps, mirroring the complex data financial analysts encounter daily.

The creation of FinMR involved a rigorous quality control protocol, including a three-stage data curation pipeline with six annotators. This process ensured the accuracy, completeness, and clarity of the dataset, drawing questions from college-level courses and professional certification programs like the Chartered Financial Analyst (CFA) and Financial Risk Management (FRM) programs. The dataset is balanced between expertise-based questions (67%) and math-focused questions (33%), and categorized by difficulty levels (easy, medium, hard) to provide a comprehensive evaluation framework.

Initial benchmarking with leading closed-source and open-source MLLMs has already revealed significant performance disparities between these models and human financial analysts. For instance, models like Gemini-2.5-Pro and Claude-3.7-Sonnet showed promising results, with Gemini-2.5-Pro achieving the best overall performance among MLLMs. However, the results also highlighted key areas for model advancement, such as precise image analysis, accurate application of complex financial formulas, and deeper contextual financial understanding. Open-source models, in particular, demonstrated a substantial performance gap compared to their closed-source counterparts, struggling with the sophisticated multimodal reasoning tasks presented in FinMR.

The research also shed light on specific challenges, such as financial math reasoning, where models generally performed lower than in expertise-based tasks, indicating a need for stronger logical rigor and multi-step calculation capabilities. An error analysis further identified common issues, with image recognition failures accounting for a significant portion of errors, especially when dealing with domain-specific visuals that require implicit information extraction. Question misunderstanding and incorrect formula application were also prevalent, particularly in harder questions requiring cross-domain knowledge integration.

FinMR is poised to be an essential benchmark tool, pushing the boundaries of multimodal financial reasoning toward professional analyst-level competence. The dataset and code are available for researchers to explore and contribute to the advancement of MLLMs in finance. You can find more details about this research paper here: FinMR: A Knowledge-Intensive Multimodal Benchmark for Advanced Financial Reasoning.

Also Read:

The authors of this significant work are Shuangyan Deng, Haizhou Peng, Jiachen Xu, Ciprian Doru Giurcăneanu, and Jiamou Liu from the University of Auckland, and Rui Mao from Nanyang Technological University.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Introducing FinMR: A New Benchmark for Advanced Financial AI Reasoning

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates