TLDR: POLYCHART QA is the first large-scale multilingual benchmark for evaluating how well AI models understand charts across different languages. It features over 22,000 charts and 26,000 question-answer pairs in 10 languages. The benchmark was created using a unique pipeline that separates chart data from rendering code, enabling flexible multilingual chart generation with rigorous quality control. Experiments revealed a significant performance disparity between English and other languages, especially low-resource ones, highlighting the need for more robust multilingual vision-language models.
Charts are a fundamental way we interpret and share data across various fields, from science to daily life. With the rise of large vision-language models (LVLMs), there has been significant progress in how machines understand and reason about these visual data representations. These advanced models can answer complex questions, summarize content, and even recreate chart images based on their data.
However, a major challenge exists: most current chart understanding benchmarks and datasets are primarily focused on English. This creates a significant barrier for global audiences and limits the applicability of these models for speakers of other languages. Leading LVLMs, for instance, might perform well on an English chart question but fail when presented with the same question in Chinese, as highlighted by the researchers.
Existing multilingual and multimodal benchmarks often focus on natural images rather than structured information like charts. While some datasets include charts, they typically involve simpler tasks like character recognition, lacking the depth required for comprehensive chart reasoning across diverse languages.
To address this critical gap, researchers Yichen Xu, Liangyu Chen, Liang Zhang, Wenxuan Wang, and Qin Jin from Renmin University of China have introduced a groundbreaking new benchmark called POLYCHART QA. This is the first large-scale multilingual chart question answering benchmark, featuring 22,606 charts and 26,151 question-answering pairs across 10 different languages.
The creation of POLYCHART QA involved a clever, decoupled pipeline. This pipeline separates the chart’s data from the code used to draw it. This innovative approach allows for flexible generation of multilingual charts by simply translating the data and reusing the existing rendering code. The team used state-of-the-art large language models for translation and implemented rigorous quality control measures to ensure that the generated multilingual charts maintain linguistic and semantic consistency.
The benchmark covers a wide range of languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese. Collectively, these languages are spoken by over 65% of the global population. POLYCHART QA includes both real-world and synthetically generated charts, providing a diverse and carefully curated resource for evaluating and advancing multilingual chart understanding.
Experiments conducted using POLYCHART QA on various LVLMs, including both open-source and closed-source models, revealed important insights. A significant performance gap was observed between English and other languages, particularly those with fewer resources and non-Latin scripts. For example, models that performed well in English often saw their accuracy drop significantly for languages like Bengali and Urdu. This highlights persistent challenges in cross-lingual alignment and visual reasoning that existing multimodal benchmarks haven’t fully captured.
The research also explored few-shot evaluation, where models are given a small number of examples to learn from. Interestingly, few-shot prompting did not consistently improve multilingual performance, suggesting that simply providing more examples might not be enough to bridge the multilingual transfer gap in current LVLMs. Furthermore, cross-lingual inference tests showed that maintaining language consistency on the question side is more crucial than on the visual side for better performance.
Also Read:
- The AI Evolution in Document Understanding: A Comprehensive Survey of MLLMs
- CoralVQA: A New AI Dataset for Understanding Coral Reef Health
In conclusion, POLYCHART QA lays a crucial foundation for developing more globally inclusive vision-language models. While the benchmark currently covers ten major languages and focuses on question answering, its flexible data pipeline allows for future expansion to more languages and diverse chart understanding tasks like summarization or fact-checking. This work aims to promote language inclusivity and accessibility in AI technologies, helping to reduce the English dominance in AI systems and support global communities in accessing AI tools in their native languages. You can find more details about this research paper here.


