TLDR: A new research paper introduces NbQA, a large-scale dataset of real-world, multi-step data analysis tasks extracted from Jupyter notebooks, and JUPITER, a framework that uses value-guided Monte Carlo Tree Search to enhance LLMs’ data analysis capabilities. This approach allows fine-tuned open-source models to achieve performance comparable to or better than proprietary models like GPT-4o on complex data analysis benchmarks, demonstrating improved multi-step reasoning, tool-use, and generalization.
Large language models (LLMs) are increasingly being used to automate data science tasks, but they often struggle with complex, multi-step reasoning and effective tool use. This limitation prevents them from fully tackling the intricate challenges of real-world data analysis. A new research paper introduces a scalable approach to address this, featuring a new dataset called NbQA and a framework named JUPITER.
The core of this work lies in improving how LLMs handle multi-step data analysis. The researchers developed a pipeline to extract high-quality, tool-based data analysis tasks and their executable solutions directly from real-world Jupyter notebooks and associated data files. This process led to the creation of NbQA, a large-scale dataset designed to reflect authentic tool-use patterns in practical data science scenarios.
Understanding NbQA: A Dataset from Real-World Notebooks
NbQA is a significant contribution, comprising 38,635 task–solution pairs. To build this, the team started by crawling approximately 1.6 million Jupyter notebooks and 3.2 million data files from GitHub. These were then rigorously filtered for quality, successful execution, and data complexity, ensuring that only high-quality, diverse examples were retained. GPT-4o mini was even used to score notebook quality and identify machine learning models present.
For fine-grained processing, GPT-4o was employed to extract 1 to 3 representative subtasks from each notebook, covering categories like Summary Statistics, Distribution Analysis, Feature Engineering, and Machine Learning. These tasks were annotated with explicit constraints and standardized output formats for automatic evaluation. Crucially, the solutions for these tasks were not generated by an LLM from scratch but extracted directly from the original notebook’s code and outputs, ensuring they reflect genuine expert workflows.
Introducing JUPITER: Data Analysis as a Search Problem
To further enhance LLMs’ multi-step reasoning, the researchers developed JUPITER. This framework formulates data analysis as a search problem within the Jupyter notebook paradigm. Imagine a tree where each node represents a notebook state, including thoughts, code, and execution results. JUPITER uses Monte Carlo Tree Search (MCTS) to explore various solution paths, generating diverse trajectories.
A key component of JUPITER is its ‘value model’. This model is trained using the trajectories collected during the MCTS process, learning to predict the expected quality of different notebook states. During inference, JUPITER leverages this value model to guide its search, efficiently identifying promising branches and collecting executable multi-step plans with minimal search steps. This intelligent guidance helps the model navigate the vast and often sparse search space of data analysis tasks more effectively.
Also Read:
- Tree-OPO: A New Path for Multistep Reasoning in LLMs
- Reinforcement Learning Unlocks Advanced Reasoning in Large Language Models
Impressive Performance and Generalization
The experimental results demonstrate the effectiveness of NbQA and JUPITER. Fine-tuning models like Qwen2.5-7B and 14B-Instruct on NbQA significantly improved their performance on the InfiAgent-DABench benchmark. When combined with JUPITER’s value-guided search, these open-source models achieved remarkable accuracy, with Qwen2.5-14B-Instruct solving 86.38% of tasks. This performance matches or even surpasses that of proprietary models like GPT-4o and other advanced agent frameworks.
Beyond specific benchmarks, JUPITER also showed strong generalization capabilities. Evaluations on DSBench, a dataset of data modeling tasks, revealed that the trained value model could effectively assist search even without further fine-tuning on the specific task format. Similarly, on the AIME 2025 benchmark, which consists of out-of-domain math competition problems, JUPITER enhanced the model’s multi-step tool-use and numerical reasoning abilities, demonstrating its transferability.
In summary, the NbQA dataset provides a rich resource for training LLMs on authentic data analysis workflows, while the JUPITER framework offers a powerful, value-guided search mechanism to navigate complex problem-solving. Together, they represent a significant step forward in empowering LLMs with advanced data analysis capabilities, allowing open-source models to compete with and even exceed the performance of commercial systems. You can read the full research paper here.


