spot_img
HomeResearch & DevelopmentJUPITER and NbQA: Advancing LLMs in Multi-Step Data Analysis

JUPITER and NbQA: Advancing LLMs in Multi-Step Data Analysis

TLDR: A new research paper introduces NbQA, a large-scale dataset of real-world, multi-step data analysis tasks extracted from Jupyter notebooks, and JUPITER, a framework that uses value-guided Monte Carlo Tree Search to enhance LLMs’ data analysis capabilities. This approach allows fine-tuned open-source models to achieve performance comparable to or better than proprietary models like GPT-4o on complex data analysis benchmarks, demonstrating improved multi-step reasoning, tool-use, and generalization.

Large language models (LLMs) are increasingly being used to automate data science tasks, but they often struggle with complex, multi-step reasoning and effective tool use. This limitation prevents them from fully tackling the intricate challenges of real-world data analysis. A new research paper introduces a scalable approach to address this, featuring a new dataset called NbQA and a framework named JUPITER.

The core of this work lies in improving how LLMs handle multi-step data analysis. The researchers developed a pipeline to extract high-quality, tool-based data analysis tasks and their executable solutions directly from real-world Jupyter notebooks and associated data files. This process led to the creation of NbQA, a large-scale dataset designed to reflect authentic tool-use patterns in practical data science scenarios.

Understanding NbQA: A Dataset from Real-World Notebooks

NbQA is a significant contribution, comprising 38,635 task–solution pairs. To build this, the team started by crawling approximately 1.6 million Jupyter notebooks and 3.2 million data files from GitHub. These were then rigorously filtered for quality, successful execution, and data complexity, ensuring that only high-quality, diverse examples were retained. GPT-4o mini was even used to score notebook quality and identify machine learning models present.

For fine-grained processing, GPT-4o was employed to extract 1 to 3 representative subtasks from each notebook, covering categories like Summary Statistics, Distribution Analysis, Feature Engineering, and Machine Learning. These tasks were annotated with explicit constraints and standardized output formats for automatic evaluation. Crucially, the solutions for these tasks were not generated by an LLM from scratch but extracted directly from the original notebook’s code and outputs, ensuring they reflect genuine expert workflows.

Introducing JUPITER: Data Analysis as a Search Problem

To further enhance LLMs’ multi-step reasoning, the researchers developed JUPITER. This framework formulates data analysis as a search problem within the Jupyter notebook paradigm. Imagine a tree where each node represents a notebook state, including thoughts, code, and execution results. JUPITER uses Monte Carlo Tree Search (MCTS) to explore various solution paths, generating diverse trajectories.

A key component of JUPITER is its ‘value model’. This model is trained using the trajectories collected during the MCTS process, learning to predict the expected quality of different notebook states. During inference, JUPITER leverages this value model to guide its search, efficiently identifying promising branches and collecting executable multi-step plans with minimal search steps. This intelligent guidance helps the model navigate the vast and often sparse search space of data analysis tasks more effectively.

Also Read:

Impressive Performance and Generalization

The experimental results demonstrate the effectiveness of NbQA and JUPITER. Fine-tuning models like Qwen2.5-7B and 14B-Instruct on NbQA significantly improved their performance on the InfiAgent-DABench benchmark. When combined with JUPITER’s value-guided search, these open-source models achieved remarkable accuracy, with Qwen2.5-14B-Instruct solving 86.38% of tasks. This performance matches or even surpasses that of proprietary models like GPT-4o and other advanced agent frameworks.

Beyond specific benchmarks, JUPITER also showed strong generalization capabilities. Evaluations on DSBench, a dataset of data modeling tasks, revealed that the trained value model could effectively assist search even without further fine-tuning on the specific task format. Similarly, on the AIME 2025 benchmark, which consists of out-of-domain math competition problems, JUPITER enhanced the model’s multi-step tool-use and numerical reasoning abilities, demonstrating its transferability.

In summary, the NbQA dataset provides a rich resource for training LLMs on authentic data analysis workflows, while the JUPITER framework offers a powerful, value-guided search mechanism to navigate complex problem-solving. Together, they represent a significant step forward in empowering LLMs with advanced data analysis capabilities, allowing open-source models to compete with and even exceed the performance of commercial systems. You can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -