TLDR: VisCoder2 introduces a new framework for building advanced visualization coding agents. It comprises VisCode-Multi-679K, a large dataset of 679K executable visualization code pairs across 12 languages with multi-round correction dialogues; VisPlotBench, a comprehensive benchmark spanning 8 languages for evaluating initial generation and multi-round self-debug; and the VisCoder2 models, which significantly outperform open-source baselines and approach proprietary models like GPT-4.1, achieving an 82.4% execution pass rate with iterative self-debugging.
Large language models (LLMs) have shown great promise in generating code, including code for data visualizations. However, current systems often struggle with practical challenges such as supporting multiple programming languages, ensuring reliable code execution, and iteratively correcting errors. These limitations stem from datasets and benchmarks that typically focus on single-round code generation and a limited number of languages.
A new research paper introduces a comprehensive framework called VisCoder2, designed to overcome these hurdles. The framework consists of three key components: a vast dataset, a robust benchmark, and a family of advanced coding agents. You can read the full paper here: VISCODER2: BUILDINGMULTI-LANGUAGE VISUALIZATIONCODINGAGENTS.
VisCode-Multi-679K: A Dataset for Multi-Language Visualization
At the heart of VisCoder2 is VisCode-Multi-679K, a large-scale dataset containing 679,000 validated and executable visualization code samples. What makes this dataset unique is its multi-language coverage, spanning 12 programming languages, and its inclusion of multi-turn correction dialogues. This means the dataset not only provides examples of correct code but also shows how models can learn to revise faulty code based on execution feedback.
The dataset was built by combining code from diverse open-source repositories like the-stack-v2, svg-diagrams, and CoSyn-400K. These sources provide a mix of real-world and synthetically generated visualization code. A rigorous process of filtering, code block extraction, and runtime validation ensures that all samples are executable and produce valid visual outputs. Additionally, 66,000 multi-turn dialogues from the Code-Feedback dataset were integrated to train models in iterative debugging, a crucial skill for real-world coding agents.
VisPlotBench: A Benchmark for Comprehensive Evaluation
To systematically evaluate visualization coding agents, the researchers developed VisPlotBench. This benchmark covers eight programming languages and features 888 diverse visualization tasks. Unlike previous benchmarks that often focus on a single language or a narrow range of chart types, VisPlotBench includes imperative libraries, declarative grammars, markup-based formats, and symbolic notations across 13 visual categories, from common bars and lines to more specialized music notation and network diagrams.
VisPlotBench uses a standardized protocol: execute, render, and score. It assesses not only the initial code generation but also the model’s ability to self-debug through multiple rounds of feedback. This multi-round evaluation is vital for understanding how agents perform in iterative development workflows.
VisCoder2: A Family of Visualization Coding Agents
The researchers trained a family of multi-language visualization models, also named VisCoder2, using the VisCode-Multi-679K dataset. These models, built on Qwen2.5-Coder-Instruct backbones at various scales (up to 32B parameters), demonstrate significant improvements over existing open-source baselines. Notably, VisCoder2 approaches the performance of proprietary models like GPT-4.1.
Experiments show that VisCoder2 achieves an impressive 82.4% overall execution pass rate at the 32B scale when iterative self-debug is enabled. This iterative correction mechanism proved particularly beneficial for symbolic and compiler-dependent languages such as LilyPond, LaTeX, and Asymptote, where syntax and compilation errors are common. The ability to self-debug allows the models to resolve frequent failures and produce valid outputs, highlighting that feedback-driven refinement is a critical component for reliable multi-language visualization.
Also Read:
- JanusCoder: A Unified Interface for Visual and Programmatic Code Intelligence
- SwiftSolve: A Multi-Agent System for Efficient Competitive Programming Solutions
Key Insights and Future Directions
The research highlights two main insights: the necessity of broad multi-language coverage, especially for challenging symbolic languages, and the indispensable role of iterative refinement. Self-debug consistently delivers substantial gains across models, particularly for languages prone to structural and semantic errors.
While VisCoder2 represents a significant leap forward, the researchers acknowledge that the dataset still has some imbalances, with common ecosystems like Python and Vega-Lite being better represented than symbolic or domain-specific languages. Expanding benchmark coverage to an even broader set of visualization frameworks is also a future goal. This work lays a strong foundation for building more robust and reliable visualization coding agents that can assist in real-world data analysis and reporting workflows.


