TLDR: Researchers developed Nemotron-CC-Math, a 133-billion-token high-quality math pretraining dataset from Common Crawl. Their novel pipeline uses a text-based browser (Lynx) and an LLM (Phi-4) to accurately extract and standardize mathematical expressions and code, overcoming limitations of prior methods. Pretraining models with Nemotron-CC-Math significantly boosts performance in math, code, and general reasoning tasks, setting a new standard for open-source math corpora.
Large Language Models (LLMs) are becoming increasingly powerful, and a key area of development is improving their ability to reason, especially in mathematics. While pretraining LLMs on vast amounts of data is common, the quality and structure of that data significantly impact their performance. A new research paper introduces Nemotron-CC-Math, a groundbreaking, high-quality mathematical pretraining dataset designed to overcome the limitations of previous efforts.
The Challenge of Mathematical Data
Existing math-focused datasets, often built from sources like Common Crawl, have struggled with quality. The process of extracting mathematical content from the web is complex due to various formats (MathJax, KaTeX, MathML) and the inherent messiness of HTML-to-text conversion. This often leads to degraded quality, loss of mathematical structure, and the omission of crucial equations or code blocks. This problem has hindered the development of LLMs with robust mathematical reasoning capabilities, as many state-of-the-art models rely on proprietary datasets that are not publicly available.
Nemotron-CC-Math: A Novel Extraction Pipeline
NVIDIA researchers, along with Boston University, have developed Nemotron-CC-Math, a massive 133-billion-token mathematical corpus. What sets this dataset apart is its innovative, domain-agnostic pipeline. Unlike prior methods that used brittle extraction rules, this new pipeline is specifically designed for robust scientific text extraction from Common Crawl, a vast archive of web data.
The pipeline works in several stages:
- Layout-Aware Rendering: It uses a text-based browser called Lynx to render HTML documents into plain text. This is crucial because Lynx preserves mathematical equations and symbols, as well as code formatting, by mimicking how a human would see the page. This avoids the common issue of equations being missed or corrupted by traditional HTML parsers.
- LLM-Based Cleaning: After rendering, a lightweight Large Language Model (Phi-4, a 14-billion-parameter model) refines the output. This LLM removes boilerplate content like navigation bars and redundant headers, standardizes diverse mathematical notations into a consistent LaTeX representation, and corrects inconsistencies or typographical errors. This intelligent cleaning stage ensures high fidelity and structural integrity.
- Quality Filtering and Deduplication: The cleaned data then undergoes a quality classification step, using a classifier to identify and retain only high-quality pages. This process creates subsets like Nemotron-CC-Math-4+ (52 billion tokens) for the highest quality. Finally, fuzzy deduplication is applied to remove near-duplicate documents, which is vital for efficient and stable model training.
Unprecedented Scale and Quality
The result is a dataset that significantly surpasses previous open-source math datasets. Nemotron-CC-Math-4+ alone contains 5.5 times more tokens than FineMath-4+, which was previously considered the highest-quality open math pretraining dataset. The full Nemotron-CC-Math-3+ corpus boasts 133 billion tokens across over 100 million documents. The dataset covers a wide range of topics, predominantly mathematics, but also includes computer science, physics, statistics, and economics.
Demonstrated Performance Gains
To validate the effectiveness of Nemotron-CC-Math, the researchers pretrained a Nemotron-T 8B model using their new corpus. The results were impressive, showing substantial improvements across various benchmarks:
- Mathematics: Gains of +4.8 to +12.6 on the MATH benchmark and +4.6 to +14.3 on MBPP+.
- Code Generation: Significant improvements in code performance, attributed to the pipeline’s ability to retain high-quality code snippets with their original syntax and structure.
- General Reasoning: Enhanced general-domain performance on MMLU and MMLU-Stem, indicating that high-quality mathematical data also strengthens broader reasoning skills.
These experiments demonstrate that Nemotron-CC-Math sets a new state of the art among open math pretraining corpora, yielding measurable gains in math, code, and general reasoning. The modular and domain-agnostic nature of the pipeline also means it can be applied to extract high-quality technical content from other scientific fields.
Also Read:
- Synthesizing High-Quality Logical Puzzles with PuzzleClone
- Unlocking Cross-Lingual Reasoning in AI: Insights from Long Chain-of-Thought Studies
Supporting Open-Source Advancement
In a move to foster community progress, the researchers have openly released their code and datasets. This commitment to open science allows other researchers to reproduce their findings and build upon this foundational work, accelerating advancements in AI’s mathematical and reasoning capabilities. You can find more details in the original research paper.


