TLDR: A new study exposes that current methods for detecting benchmark contamination in large reasoning models (LRMs) are easily circumvented. Reinforcement learning (RL) training can effectively conceal initial contamination, and even extensive contamination using Chain-of-Thought (CoT) on advanced LRMs leaves minimal detectable traces for existing techniques. This fragility severely undermines the integrity of AI leaderboards and necessitates the urgent development of new, more robust detection protocols that can account for LRMs’ generalization abilities rather than solely focusing on memorization.
The competitive landscape of large reasoning models (LRMs) drives developers to achieve top rankings on performance leaderboards. A concerning practice, known as benchmark contamination, involves incorporating evaluation benchmarks directly into a model’s training data. This leads to artificially inflated performance, compromising the fairness and trustworthiness of these public rankings. Despite the existence of numerous methods designed to detect such contamination, a recent study highlights a surprising vulnerability: these detection mechanisms are alarmingly easy to bypass.
Researchers from the University of Illinois Urbana-Champaign and the University of Washington have conducted the first comprehensive study into benchmark contamination specifically within LRMs. Their findings expose a critical weakness in how these advanced AI models are currently evaluated.
Contamination During Model Evolution (Pre-LRM Stage)
The study examined two primary scenarios where contamination can occur. The first, referred to as “Stage I (pre-LRM),” investigates contamination introduced as a base model develops into an LRM through supervised fine-tuning (SFT) and reinforcement learning (RL). Initially, contamination during the SFT phase is detectable by existing methods. However, the research revealed that even a brief period of Group Relative Policy Optimization (GRPO) training can significantly obscure these contamination signals. Detailed empirical experiments and theoretical analysis pinpoint Proximal Policy Optimization (PPO)-style importance sampling and clipping objectives as the root cause of this concealment. This suggests that a wide range of RL methods may inherently possess this ability to hide contamination. The researchers observed a consistent decline in detection performance (measured by AUROC) as the number of RL training steps increased, with some detection methods performing little better than random guesses after only 156 steps. Importantly, this concealment does not mean the model “forgets” the contaminated data; the inflated performance persists, but the evidence of contamination becomes minimal.
Also Read:
- AI’s Struggle with Novelty: When Missing Rules Expose Reasoning Gaps
- Unpacking AI’s Thought Process: A New Framework for Evaluating Tool-Augmented Agents
Contamination in Advanced LRMs (Post-LRM Stage)
The second scenario, “Stage II (post-LRM),” focused on contamination applied to already advanced LRMs as a final SFT step, specifically involving Chain-of-Thought (CoT) reasoning. In this context, most existing contamination detection methods performed near random guesses. Even without exposure to non-member samples, contaminated LRMs demonstrated increased confidence when responding to unseen samples that shared similar distributions with the training set. This observation fundamentally challenges the core assumption of many current detection techniques, which largely presume that benchmark contamination is primarily about memorizing specific samples. Instead, LRMs appear to internalize the underlying knowledge and reasoning processes during contamination, enabling them to generalize to distributionally similar questions and thereby evade memorization-based detectors.
These findings highlight an urgent need for more advanced contamination detection methods and robust evaluation protocols tailored to LRMs. The current reliance on log-probabilities and the assumption that training samples will consistently incur lower loss than unseen samples is proving insufficient. The researchers propose two key directions: model developers should release more intermediate training checkpoints to allow for better monitoring and regulation of potential benchmark contamination at each training stage. Additionally, researchers developing contamination detection methods must move beyond memorization-driven approaches and explicitly account for the long Chain-of-Thought reasoning and generalization capabilities inherent in LRMs.
For a comprehensive understanding of the research and its implications, you can access the full paper here: On the Fragility of Benchmark Contamination Detection in Reasoning Models.


