spot_img
HomeResearch & DevelopmentChartGen: A New Approach to Understanding and Generating Data...

ChartGen: A New Approach to Understanding and Generating Data Visualizations

TLDR: ChartGen is an automated pipeline that creates a massive dataset of chart images and their corresponding Python plotting code. It starts with existing chart images, uses a vision-language model to convert them into code, and then uses a large language model to iteratively augment and diversify this code. This process generated over 222,000 unique chart-code pairs, covering 27 chart types and 11 plotting libraries. The project also includes a benchmark for evaluating models on chart-to-code reconstruction, revealing that current models still have significant room for improvement in accurately reproducing charts from images.

Understanding and interpreting data visualizations, like charts, is crucial in many fields, from scientific research to business analysis. While artificial intelligence models have made strides in answering questions about charts or summarizing them, a more challenging task remains largely unexplored: chart-to-code reconstruction. This involves taking a chart image and accurately recreating the executable plotting script that generated it. This capability is vital for evaluating how well AI models can truly understand and ground visual data in a precise, machine-readable format.

To address this gap, researchers have introduced ChartGen, a fully automated pipeline designed for code-guided synthetic chart generation. ChartGen aims to significantly scale and diversify the available resources for chart understanding research.

How ChartGen Works: A Two-Stage Process

The ChartGen pipeline operates in two main stages:

1. VLM-based Chart Image Redrawing: It begins with a collection of existing chart images, referred to as ‘seed’ images. A vision-language model (VLM), specifically phi-3.5-vision-instruct, is prompted to analyze each seed image and reconstruct it into a Python plotting script. For their work, the researchers used 13,000 unique chart images from the ChartQA dataset as their initial seeds. The primary goal here isn’t perfect replication, but rather to get an initial structured code representation of the chart’s content.

2. LLM-based Chart Code Augmentation: The Python scripts generated in the first stage are then fed into a code-focused large language model (LLM), Codestral-22B-v0.1. This LLM iteratively refines and diversifies the plotting code. Instead of just altering the visual appearance of the chart, ChartGen transforms the underlying code itself. This allows for the creation of new plotting scripts and charts with varied types, styles, data distributions, and complexities. This iterative augmentation process dramatically expands the initial dataset.

The ChartGen-200K Dataset: A Comprehensive Resource

By applying this pipeline, the ChartGen project has created an impressive synthetic dataset called ChartGen-200K. This dataset comprises 222,500 unique chart image-code pairs, a substantial increase from the initial 13,000 seed images. It covers a wide array of 27 distinct chart types, ranging from common bar and line charts to more specialized visualizations like 3D plots, heatmaps, and sunburst diagrams. Furthermore, it incorporates 11 different Python visualization libraries, including popular ones like matplotlib, seaborn, and plotly, ensuring broad stylistic and layout diversity.

Beyond just image and code pairs, ChartGen-200K is enriched with additional multimodal data components. Each entry includes extracted CSV tabular data, DocTags (a compact representation for semantic and structural attributes), natural language summaries, and automatically generated question-answer (QA) pairs. This makes it a comprehensive resource for various chart understanding tasks.

Compared to previous datasets for chart-to-code research, ChartGen-200K is significantly larger and more diverse, supporting a greater number of chart types and plotting back-ends. This scale and breadth are crucial for training robust multimodal AI models.

Evaluating Chart Redrawing Capabilities

To assess how well vision-language models can perform chart redrawing, the researchers curated a dedicated evaluation set of 4,300 chart image-code pairs from the larger ChartGen-200K corpus. The task involves a model taking a chart image and producing a Python plotting script that closely matches the original, faithfully reconstructing its visual content and style.

The evaluation employs a two-pronged strategy using GPT-4o as an automated judge. It compares both the predicted code and the resulting rendered images. For code comparison, scores are given for ‘data fidelity’ (how well the underlying data values match) and ‘semantic/style consistency’ (how well chart types, orientations, labels, and colors are preserved). For image comparison, the model-generated chart is visually compared to the ground-truth image for overall similarity.

Also Read:

Current Performance and Future Outlook

The evaluation of six open-weight vision-language models (ranging from 3 billion to 26 billion parameters) on the ChartGen benchmark revealed that while models can produce syntactically valid code (indicated by moderate execution rates), accurately capturing numerical values, relationships, and stylistic elements remains a significant challenge. The best model achieved a data fidelity score of 0.58 out of 1 and an image similarity score of 7.48 out of 10, highlighting substantial room for improvement in chart-to-code reconstruction and vision-conditioned code generation.

ChartGen represents a major step forward in creating large-scale, multimodal datasets for chart understanding. By releasing the pipeline, prompts, and the dataset under an open license, the researchers aim to accelerate progress towards more robust automated chart understanding. While the pipeline is powerful, it acknowledges that it may inherit biases from its underlying AI models, pointing to future work in addressing these biases and further expanding the dataset’s scale and reasoning capabilities. For more technical details, you can refer to the full research paper: ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -