TLDR: A new system uses multi-modal Large Language Models (LLMs) to automatically identify and explain design flaws in data visualizations, accepting both image and code inputs. It applies chart-specific rules to provide constructive feedback and corrected code, aiming to educate users on best practices. While highly effective for objective structural errors like non-zero baselines, it shows limitations with more subjective stylistic issues.
Creating effective data visualizations is a blend of art and science, a skill often not formally taught in data science programs. This gap leads to many practitioners struggling to produce graphics that clearly and efficiently convey their intended message. To address this, community initiatives like #MakeoverMonday encourage improving existing charts. Building on this concept, a new research paper explores how multi-modal large language models (LLMs) can automate this “visualization makeover” process.
The Challenge of Good Visualizations
Data visualization is crucial for communication across various fields, but formal training in design best practices is often lacking. These practices are constantly evolving, making them difficult to teach in traditional settings. The paper highlights that while LLMs have been used for generating visualizations or detecting errors in textual inputs, their potential for critically evaluating and improving existing visualizations has been less explored.
How the System Works
The researchers propose a system that takes a plot as input, either as an image file or the code used to generate it. Primed with a list of visualization best practices, an LLM is then employed to semi-automatically generate constructive criticism, aiming to produce a “better” plot. The core of this system lies in prompt engineering a pre-trained model, combining user-specified guidelines with the LLM’s inherent knowledge of data visualization practices from its training data.
Unlike other tools that focus on generating valid visualization scripts from raw data, this system emphasizes educating the user on how to improve their existing data visualizations based on an interpretation of best practices. It identifies “grammatical” errors, such as inappropriate use of dual axes, or “style” errors, like the misuse of 3D effects, and provides targeted suggestions for improvement.
The system’s workflow is modular and multi-stage. First, it detects the chart type from the input. Based on this, it evaluates relevant properties against predefined thresholds. Then, it loads and applies chart-specific visualization rules stored in a structured JSON file (e.g., “No more than 7 pie slices,” “Avoid dual axes for line charts”). The LLM then analyzes the chart against these rules and thresholds, identifying design flaws and generating natural-language feedback. If the input was code, the system can also generate a corrected version. The final feedback is presented through a user-friendly web interface.
Evaluating Performance
To assess the system’s accuracy, a quantitative evaluation was performed using a synthetic dataset of 72 visualization images, encompassing 12 distinct error types. These errors included issues like improper scale, non-zero baselines, overuse of gridlines, and inappropriate color choices. The evaluation focused on the system’s ability to detect these visual issues, using standard multi-label classification metrics such as precision, recall, and F1-score, as well as Mean Absolute Error (MAE) for predicted error counts.
Key Findings
The results showed that the system performed exceptionally well in detecting error types with clear and well-defined visual patterns. It achieved perfect F1-scores for “Non-Zero Baselines” and “Dual Axis Issues,” and high scores for “Too Many Slices in Pie Charts” and “Improper Scale or Axis Range.” However, more stylistic or ambiguous error types, such as “Inappropriate Colour Choices” and “Overlapping Data Elements,” were more frequently misclassified, indicating areas for improvement.
On average, the system’s prediction for the total number of errors deviated by about 0.44 errors (MAE). It also showed a slight tendency to underestimate the number of errors. When comparing performance on images with a single error versus multiple errors, the MAE increased for multi-error images, highlighting the challenge of overlapping issues.
Also Read:
- AI-Powered Tools for Navigating Complex Codebases
- New Benchmark Reveals AI’s Struggle with Causal Reasoning in Infographics
Looking Ahead
This research demonstrates the significant potential of LLMs in automating visualization critique. While highly effective for objective structural flaws, there’s room to improve accuracy for more interpretive or stylistic issues. Future work aims to incorporate data-aware reasoning, expand the rule base for more complex chart types, and enhance visual robustness through advanced computer vision integration. For more details, you can read the full paper here.


