TLDR: This research explores using Contrastive Decoding (CD) to generate high-quality synthetic data for training language models, especially in low-resource settings. It finds that synthetic data, particularly from CD, significantly improves model performance on reasoning-oriented tasks, while traditional sampling benefits linguistic competence. The study highlights that the contrastive mechanism, rather than simple masking, is crucial for these gains, and that using an earlier model checkpoint as the ‘bad’ model is an effective strategy.
Large language models (LLMs) have demonstrated remarkable capabilities, but their insatiable demand for vast amounts of text data is pushing the limits of publicly available information. This challenge has led researchers to explore synthetic data generation as a promising solution to expand training corpora. However, simply generating text from existing models can introduce issues like noise, factual errors, and even lead to a phenomenon known as “model collapse” over generations.
A recent study, Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling, investigates a novel approach: leveraging Contrastive Decoding (CD) to create higher-quality synthetic data for training new language models from scratch. Traditionally, CD is an inference-time strategy that improves the quality of generated responses by contrasting a ‘GOOD’ (better performing) model with a ‘BAD’ (worse performing) model, amplifying the preferences of the stronger model.
The Core Idea: CD as a Corpus Generator
The researchers, Jannek Ulm, Kevin Du, and Vésteinn Snæbjarnarson, repurposed this inference-time technique for data generation. Their goal was to determine if the benefits of CD in generating coherent and informative text could translate into gains when used to synthesize entire corpora for pre-training. They conducted experiments in a controlled setting, adhering to the BabyLM Challenge’s strict budget of 100 million words, prioritizing data efficiency.
Experimental Approach
The experimental pipeline involved several key steps:
- Starting with an original corpus (100 million tokens, a modified BabyLM corpus called TinyBabyLM).
- Training baseline language models on this original corpus.
- Generating synthetic corpora (each 100 million tokens) using two main strategies: Contrastive Decoding and standard (non-contrastive) ancestral sampling.
- Training new models on a mixture of the original and the newly generated synthetic corpora.
- Evaluating these models on a suite of downstream tasks, including those testing linguistic competence and reasoning skills.
For Contrastive Decoding, the choice of the ‘BAD’ model was crucial. The study explored three methods: using smaller models, earlier checkpoints of the same model, or applying attention dropout to the ‘GOOD’ model during inference. The ‘GOOD’ model was selected as the best performing checkpoint from the baseline training runs.
Key Findings and Insights
The research yielded several significant findings:
Firstly, mixing synthetic data with real data consistently improved performance across the board compared to training solely on real text. This underscores the potential of synthetic data to augment limited real-world corpora.
Secondly, and most notably, Contrastive Decoding proved particularly effective for tasks requiring reasoning skills. Models trained with CD-generated synthetic data showed stronger gains on benchmarks like BLiMP Supplement, Entity Tracking, EWoK (Elements of World Knowledge), and WUG (morphology evaluation). In contrast, synthetic data generated through traditional non-contrastive sampling led to better performance on the language modeling objective (lower perplexity) and core linguistic competence tasks like BLiMP.
Thirdly, the study confirmed that the ‘contrastive scoring’ mechanism – the act of subtracting the preferences of a weaker model – was the key driver of CD’s benefits, not merely restricting token choices to a plausible set. This suggests that CD actively shapes the synthetic text towards more constrained and reasoning-relevant trajectories.
Finally, among the various ways to instantiate a ‘BAD’ model, using an earlier checkpoint of the same model emerged as the most effective and practical choice. This method requires no additional model training, making it an operationally attractive option for generating a strong contrastive signal.
The researchers also found that an optimal mixing ratio of 30% synthetic data with 70% real data yielded the strongest overall performance, and that light truncation (e.g., Top-k=200) during generation could provide additional performance headroom, especially for CD.
Also Read:
- Enhancing Cross-Lingual Abilities in LLMs with Code-Switching
- Learning Optimal Unmasking Strategies for Discrete Diffusion Language Models
Implications for Future Language Model Development
These findings suggest a practical division of labor for synthetic data generation: Contrastive Decoding is beneficial when the downstream applications demand multi-step inference, state maintenance, or world knowledge, while vanilla sampling is more suited for minimizing perplexity or improving core grammatical regularities. This work opens avenues for future research, including exploring iterative applications of CD and addressing limitations related to scale, bias, and factuality in synthetic data.


