JUPITER and NbQA: Advancing LLMs in Multi-Step Data Analysis

TLDR: A new research paper introduces NbQA, a large-scale dataset of real-world, multi-step data analysis tasks extracted from Jupyter notebooks, and JUPITER, a framework that uses value-guided Monte Carlo Tree Search to enhance LLMs’ data analysis capabilities. This approach allows fine-tuned open-source models to achieve performance comparable to or better than proprietary models like GPT-4o on complex data analysis benchmarks, demonstrating improved multi-step reasoning, tool-use, and generalization.

Large language models (LLMs) are increasingly being used to automate data science tasks, but they often struggle with complex, multi-step reasoning and effective tool use. This limitation prevents them from fully tackling the intricate challenges of real-world data analysis. A new research paper introduces a scalable approach to address this, featuring a new dataset called NbQA and a framework named JUPITER.

The core of this work lies in improving how LLMs handle multi-step data analysis. The researchers developed a pipeline to extract high-quality, tool-based data analysis tasks and their executable solutions directly from real-world Jupyter notebooks and associated data files. This process led to the creation of NbQA, a large-scale dataset designed to reflect authentic tool-use patterns in practical data science scenarios.

Understanding NbQA: A Dataset from Real-World Notebooks

NbQA is a significant contribution, comprising 38,635 task–solution pairs. To build this, the team started by crawling approximately 1.6 million Jupyter notebooks and 3.2 million data files from GitHub. These were then rigorously filtered for quality, successful execution, and data complexity, ensuring that only high-quality, diverse examples were retained. GPT-4o mini was even used to score notebook quality and identify machine learning models present.

For fine-grained processing, GPT-4o was employed to extract 1 to 3 representative subtasks from each notebook, covering categories like Summary Statistics, Distribution Analysis, Feature Engineering, and Machine Learning. These tasks were annotated with explicit constraints and standardized output formats for automatic evaluation. Crucially, the solutions for these tasks were not generated by an LLM from scratch but extracted directly from the original notebook’s code and outputs, ensuring they reflect genuine expert workflows.

Introducing JUPITER: Data Analysis as a Search Problem

To further enhance LLMs’ multi-step reasoning, the researchers developed JUPITER. This framework formulates data analysis as a search problem within the Jupyter notebook paradigm. Imagine a tree where each node represents a notebook state, including thoughts, code, and execution results. JUPITER uses Monte Carlo Tree Search (MCTS) to explore various solution paths, generating diverse trajectories.

A key component of JUPITER is its ‘value model’. This model is trained using the trajectories collected during the MCTS process, learning to predict the expected quality of different notebook states. During inference, JUPITER leverages this value model to guide its search, efficiently identifying promising branches and collecting executable multi-step plans with minimal search steps. This intelligent guidance helps the model navigate the vast and often sparse search space of data analysis tasks more effectively.

Also Read:

Impressive Performance and Generalization

The experimental results demonstrate the effectiveness of NbQA and JUPITER. Fine-tuning models like Qwen2.5-7B and 14B-Instruct on NbQA significantly improved their performance on the InfiAgent-DABench benchmark. When combined with JUPITER’s value-guided search, these open-source models achieved remarkable accuracy, with Qwen2.5-14B-Instruct solving 86.38% of tasks. This performance matches or even surpasses that of proprietary models like GPT-4o and other advanced agent frameworks.

Beyond specific benchmarks, JUPITER also showed strong generalization capabilities. Evaluations on DSBench, a dataset of data modeling tasks, revealed that the trained value model could effectively assist search even without further fine-tuning on the specific task format. Similarly, on the AIME 2025 benchmark, which consists of out-of-domain math competition problems, JUPITER enhanced the model’s multi-step tool-use and numerical reasoning abilities, demonstrating its transferability.

In summary, the NbQA dataset provides a rich resource for training LLMs on authentic data analysis workflows, while the JUPITER framework offers a powerful, value-guided search mechanism to navigate complex problem-solving. Together, they represent a significant step forward in empowering LLMs with advanced data analysis capabilities, allowing open-source models to compete with and even exceed the performance of commercial systems. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

JUPITER and NbQA: Advancing LLMs in Multi-Step Data Analysis

Understanding NbQA: A Dataset from Real-World Notebooks

Introducing JUPITER: Data Analysis as a Search Problem

Impressive Performance and Generalization

Gen AI News and Updates

Google Unveils Free 5-Day AI Agents Intensive Course on Kaggle

FaithAct: A Framework for Verifying AI’s Visual Reasoning Steps

SymLight: Unlocking Interpretable and Deployable Traffic Signal Control

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates