TLDR: A large-scale study investigated the impact of synthetic data on LLM pre-training, revealing that strategic mixtures of synthetic and natural data can significantly accelerate convergence (5-10x speedup with 1/3 rephrased synthetic data). The effectiveness is highly dependent on the synthetic data type and mixture ratio, with optimal ratios for rephrased data converging around 30%. Surprisingly, larger generator models (e.g., 70B parameters) do not always produce superior synthetic data compared to moderately sized ones (e.g., 8B parameters). The study also provides nuanced evidence on ‘model collapse,’ showing no degradation with rephrased synthetic data but patterns consistent with collapse for textbook-style synthetic data mixtures.
The rapid advancements in Large Language Models (LLMs) are heavily reliant on vast amounts of high-quality training data. However, the supply of such natural data is becoming increasingly limited. This challenge has led researchers to explore synthetic data – text generated by existing models or automated systems – as a promising alternative to augment or even replace traditional human-generated content during the crucial pre-training phase.
A recent large-scale study, detailed in the paper Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls, sheds light on the effectiveness, conditions, and potential drawbacks of using synthetic data. Conducted by researchers from FAIR at Meta, Virginia Tech, Cerebras Systems, and an independent consultant, this extensive investigation involved training over 1000 LLMs and consumed more than 100,000 GPU hours, providing a unified protocol for understanding synthetic data’s role.
Conditional Benefits and Optimal Mixtures
The study found that synthetic data is not a one-size-fits-all solution but offers significant benefits when used strategically. Pre-training on rephrased synthetic data alone did not prove faster than using natural web texts. However, a mixture of 1/3 rephrased synthetic data with 2/3 natural web texts could accelerate pre-training convergence by 5-10 times to reach the same validation loss, especially at larger data budgets. This suggests that synthetic data acts as a powerful accelerator rather than a standalone replacement.
The type of synthetic data and its proportion in the training mixture are critical. The research explored two main paradigms: web rephrasing (creating high-quality or question-answering styles from existing web content) and synthetic textbooks (generating entirely new, dense educational content). While rephrased data showed clear advantages in mixtures, pre-training solely on textbook-style synthetic data resulted in notably higher loss across many downstream domains, particularly with smaller data budgets.
Through a fine-grained grid search, the researchers identified that the “good” ratio of synthetic data in mixtures is nuanced, varying with the data type, model scale, and data budget. For high-quality rephrased data, the optimal mixture consistently hovered around 30% synthetic data combined with 70% CommonCrawl. For question-answering rephrased data, this ratio tended to decrease with larger models and data sizes, also converging towards 30%. Textbook-style data showed benefits primarily at larger scales, with optimal ratios generally remaining below those for rephrased data.
Generator Model Impact and Model Collapse
A surprising finding concerned the capability of the generator model used to create synthetic data. It’s often assumed that larger, more capable generator models would produce superior synthetic data. However, the study challenged this intuition. While a certain baseline capability was beneficial (e.g., synthetic data from Llama-3-8B outperformed Llama-3-3B), increasing the generator size further to Llama-3-70B did not consistently yield better synthetic data for pre-training. In some cases, the Llama-3-70B generator even led to worse evaluation results for high-quality rephrased data, suggesting that factors beyond sheer scale, such as instruction-following fidelity or diversity of generated outputs, play a crucial role.
The research also contributed mixed evidence to the theoretical concern of “model collapse,” where recursive training on model-generated data could degrade performance. For single-round training, rephrased synthetic data showed no degradation in performance at foreseeable scales. In fact, mixtures with rephrased data were projected to achieve a lower irreducible loss (the theoretical minimum loss) than natural data alone, with 33% high-quality rephrased data + 67% CommonCrawl showing the lowest projected irreducible loss. Conversely, training on mixtures of textbook-style pure-generated synthetic data did show patterns consistent with predictions of model collapse, resulting in notably higher loss.
Also Read:
- Beyond Stereotypes: Verbalized Sampling Unlocks LLM Diversity by Tackling Hidden Data Bias
- RoRecomp: Making LLMs Reason More Concisely and Efficiently
Practical Guidance
This comprehensive study underscores that synthetic data is not a magic bullet but a powerful tool that requires careful, empirically-informed deployment. Its benefits are conditional on the generation method, mixture strategy, and even the choice of generator model. The findings provide practical guidance for LLM developers, emphasizing the importance of strategic mixing and understanding the characteristics of different synthetic data types to accelerate pre-training convergence and potentially achieve better ultimate performance without necessarily succumbing to model collapse.


