NVIDIA and Boston University Unveil Nemotron-CC-Math: A New Era for AI's Mathematical Prowess

TLDR: Researchers developed Nemotron-CC-Math, a 133-billion-token high-quality math pretraining dataset from Common Crawl. Their novel pipeline uses a text-based browser (Lynx) and an LLM (Phi-4) to accurately extract and standardize mathematical expressions and code, overcoming limitations of prior methods. Pretraining models with Nemotron-CC-Math significantly boosts performance in math, code, and general reasoning tasks, setting a new standard for open-source math corpora.

Large Language Models (LLMs) are becoming increasingly powerful, and a key area of development is improving their ability to reason, especially in mathematics. While pretraining LLMs on vast amounts of data is common, the quality and structure of that data significantly impact their performance. A new research paper introduces Nemotron-CC-Math, a groundbreaking, high-quality mathematical pretraining dataset designed to overcome the limitations of previous efforts.

The Challenge of Mathematical Data

Existing math-focused datasets, often built from sources like Common Crawl, have struggled with quality. The process of extracting mathematical content from the web is complex due to various formats (MathJax, KaTeX, MathML) and the inherent messiness of HTML-to-text conversion. This often leads to degraded quality, loss of mathematical structure, and the omission of crucial equations or code blocks. This problem has hindered the development of LLMs with robust mathematical reasoning capabilities, as many state-of-the-art models rely on proprietary datasets that are not publicly available.

Nemotron-CC-Math: A Novel Extraction Pipeline

NVIDIA researchers, along with Boston University, have developed Nemotron-CC-Math, a massive 133-billion-token mathematical corpus. What sets this dataset apart is its innovative, domain-agnostic pipeline. Unlike prior methods that used brittle extraction rules, this new pipeline is specifically designed for robust scientific text extraction from Common Crawl, a vast archive of web data.

The pipeline works in several stages:

Layout-Aware Rendering: It uses a text-based browser called Lynx to render HTML documents into plain text. This is crucial because Lynx preserves mathematical equations and symbols, as well as code formatting, by mimicking how a human would see the page. This avoids the common issue of equations being missed or corrupted by traditional HTML parsers.
LLM-Based Cleaning: After rendering, a lightweight Large Language Model (Phi-4, a 14-billion-parameter model) refines the output. This LLM removes boilerplate content like navigation bars and redundant headers, standardizes diverse mathematical notations into a consistent LaTeX representation, and corrects inconsistencies or typographical errors. This intelligent cleaning stage ensures high fidelity and structural integrity.
Quality Filtering and Deduplication: The cleaned data then undergoes a quality classification step, using a classifier to identify and retain only high-quality pages. This process creates subsets like Nemotron-CC-Math-4+ (52 billion tokens) for the highest quality. Finally, fuzzy deduplication is applied to remove near-duplicate documents, which is vital for efficient and stable model training.

Unprecedented Scale and Quality

The result is a dataset that significantly surpasses previous open-source math datasets. Nemotron-CC-Math-4+ alone contains 5.5 times more tokens than FineMath-4+, which was previously considered the highest-quality open math pretraining dataset. The full Nemotron-CC-Math-3+ corpus boasts 133 billion tokens across over 100 million documents. The dataset covers a wide range of topics, predominantly mathematics, but also includes computer science, physics, statistics, and economics.

Demonstrated Performance Gains

To validate the effectiveness of Nemotron-CC-Math, the researchers pretrained a Nemotron-T 8B model using their new corpus. The results were impressive, showing substantial improvements across various benchmarks:

Mathematics: Gains of +4.8 to +12.6 on the MATH benchmark and +4.6 to +14.3 on MBPP+.
Code Generation: Significant improvements in code performance, attributed to the pipeline’s ability to retain high-quality code snippets with their original syntax and structure.
General Reasoning: Enhanced general-domain performance on MMLU and MMLU-Stem, indicating that high-quality mathematical data also strengthens broader reasoning skills.

These experiments demonstrate that Nemotron-CC-Math sets a new state of the art among open math pretraining corpora, yielding measurable gains in math, code, and general reasoning. The modular and domain-agnostic nature of the pipeline also means it can be applied to extract high-quality technical content from other scientific fields.

Also Read:

Supporting Open-Source Advancement

In a move to foster community progress, the researchers have openly released their code and datasets. This commitment to open science allows other researchers to reproduce their findings and build upon this foundational work, accelerating advancements in AI’s mathematical and reasoning capabilities. You can find more details in the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

NVIDIA and Boston University Unveil Nemotron-CC-Math: A New Era for AI’s Mathematical Prowess

The Challenge of Mathematical Data

Nemotron-CC-Math: A Novel Extraction Pipeline

Unprecedented Scale and Quality

Demonstrated Performance Gains

Supporting Open-Source Advancement

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates