TLDR: The RADAR framework introduces a novel approach to make Multimodal Large Language Models (MLLMs) more transparent when analyzing data visualizations like charts. It achieves this by attributing the MLLM’s reasoning process to specific regions within the charts, using bounding boxes to highlight the visual data that supports both final answers and intermediate mathematical reasoning steps. RADAR includes a new dataset and a method that significantly improves attribution accuracy and leads to stronger answer generation, paving the way for more trustworthy and interpretable AI systems for visual data analysis.
As data visualizations like charts become central to quantitative analysis and decision-making, the ability to accurately interpret them is more crucial than ever. Multimodal Large Language Models (MLLMs) have shown great promise in automating visual data analysis, from answering questions about charts to generating summaries. However, a significant challenge remains: these models often operate as “black boxes,” providing conclusions without revealing which parts of the visual data informed their decisions. This lack of transparency can hinder trust and adoption in real-world applications, especially in sensitive fields like business, medicine, and education.
A new research paper introduces RADAR, a Reasoning-Guided Attribution Framework for Explainable Visual Data Analysis, which takes a significant step towards addressing this issue. The framework aims to evaluate and enhance MLLMs’ capabilities to attribute their reasoning process by highlighting specific regions in charts and graphs that justify their answers. This makes the reasoning process transparent and verifiable for users.
The core idea behind RADAR is to identify and highlight key regions within charts using bounding boxes. This approach not only explains the final decision but also provides visibility into the intermediate mathematical reasoning steps. Previous research on attribution has largely focused on text-based or general visual question-answering, which often falls short when applied to complex mathematical chart analysis. Existing methods struggle to pinpoint relevant chart regions for intricate mathematical questions, such as comparing differences between lines across different years.
How RADAR Works
RADAR operates through a two-stage pipeline. First, given a chart, a question, and an answer, the system generates step-by-step reasoning using the InternLM-XComposer2 model. This model processes visual tokens from the chart and textual inputs, adapting to chart-specific features while maintaining strong language capabilities. Second, these generated reasoning steps, along with the original chart, question, and answer, are used to produce attribution bounding boxes. These boxes highlight the specific visual elements that correspond to both the final answer and each intermediate reasoning step.
The framework offers two distinct levels of attribution. One is Answer-Level Attribution, which involves visually linking chart elements to the final answer using bounding boxes. For example, if the answer is a calculated product, this level would highlight all the data bars contributing to that calculation. The other is Reasoning-Level Attribution, which is more granular. For mathematical questions, the path to the answer often involves multiple steps. RADAR attributes each reasoning step to relevant chart regions, creating a traceable connection between the reasoning process and the visual elements. This means that for each calculation or comparison step, the specific data points or lines used are highlighted.
A New Dataset for Explainable Chart Analysis
To enable this research, the authors contributed a semi-automatic approach to create a benchmark dataset. This dataset comprises 17,819 diverse samples, including various charts, questions, detailed reasoning steps, and attribution annotations. Derived from the ChartQA dataset, it covers line and bar chart types and a range of mathematical operations. The data curation strategy combined MLLM-generated reasoning and attribution annotations with human corrections, ensuring high quality.
The dataset includes 1,000 charts (500 line, 500 bar), leading to 2,000 question-answer pairs. Human annotators identified 3,599 reasoning steps and attributed 4,092 regions for answer-based questions and 7,128 regions for reasoning-based steps, demonstrating the complexity and detail captured.
Performance and Impact
Experimental results show that RADAR significantly improves attribution accuracy. Compared to baseline methods like GPT-4o, GPT-4v, and Claude 3.5 Sonnet, RADAR’s reasoning-guided approach improves attribution accuracy by an average of 15%. Specifically, it showed substantial improvements in Multi Box IOU scores for both answer-based (VQA) and reasoning-based (VQR) attribution tasks. For instance, automated reasoning improved VQA tasks by 446% to 504% and VQR tasks by 110% to 230% over baselines. When human-validated reasoning was incorporated, VQR task improvements soared to 268% for line charts and 405% for bar charts.
Furthermore, these enhanced attribution capabilities translate directly to stronger answer generation. The system achieved an average BERTScore of approximately 0.90, indicating a high alignment with ground truth responses. This demonstrates a synergistic relationship where better attribution leads to more accurate answers.
The framework also proved its ability to generalize and scale. When extended to pie charts, the fully automated approach achieved an average BERTScore of around 0.9 and an average Semantic Textual Similarity (STS) of approximately 0.5 for generated answers, confirming its robustness across different visualization formats.
Also Read:
- MaRVL-QA: Uncovering the Limits of AI in Visual Mathematical Reasoning
- REFINE: Enhancing Multimodal AI Performance Through Targeted Error Feedback
Looking Ahead
While RADAR represents a significant advancement, the researchers acknowledge limitations, including the challenges of human attribution, the dependency on reasoning quality, and computational requirements. Nevertheless, this work lays a strong foundation for building more trustworthy and interpretable AI systems for mathematical reasoning tasks, enabling users to verify and understand model decisions through transparent reasoning and attribution. You can read the full research paper here.


