TLDR: This research introduces Token Probability Deviation (TBD), a novel method for detecting if a question was used in the distillation training of reasoning models. Addressing the challenge of partial data availability, TBD analyzes the probability patterns of generated tokens, noting that models produce more predictable tokens for seen questions and lower-probability tokens for unseen ones. The method quantifies this deviation to distinguish between member and non-member data. Experiments show TBD significantly outperforms existing baselines across diverse models and datasets, enhancing transparency and fairness in AI evaluation.
The rapid advancements in Large Language Models (LLMs) have brought about impressive capabilities, particularly in complex reasoning tasks like mathematics and coding. However, these powerful models often come with a significant computational cost, making their deployment challenging in environments with limited resources. To overcome this, a technique called reasoning distillation has emerged, allowing the transfer of these advanced reasoning abilities from large models to smaller, more efficient ones (SLMs).
While reasoning distillation is a powerful paradigm, it introduces a critical concern: benchmark contamination. This occurs when evaluation data is inadvertently included in the distillation datasets used for training, potentially inflating the performance metrics of the distilled models and giving a misleading impression of their true capabilities. This issue highlights a pressing need for methods to detect such contaminated data.
A new research paper, titled “Detecting Distillation Data from Reasoning Models,” addresses this challenge by formally defining the task of distillation data detection. This task is uniquely difficult because, during detection, only the question is available, without access to the corresponding reasoning steps or answers that were part of the original distillation data. Traditional methods for detecting training data often rely on having the complete input-output pairs, which is not feasible in this scenario due to the non-deterministic nature of model generation and the proprietary status of many datasets.
The researchers propose a novel and effective method called Token Probability Deviation (TBD). This method is inspired by a key observation: distilled models tend to generate highly predictable, or “near-deterministic,” tokens when responding to questions they have encountered during their distillation training. In contrast, for questions they haven’t seen, they produce a greater number of lower-probability tokens, indicating less certainty in their generation process.
TBD quantifies this difference by measuring how much the probabilities of the generated output tokens deviate from a high reference probability. Essentially, the method assigns lower scores to questions that were part of the distillation data (seen questions) and higher scores to questions that were not (unseen questions). This allows for a clear distinction between the two.
Extensive experiments were conducted to validate TBD’s effectiveness across various models and datasets. The results demonstrate that TBD significantly outperforms existing baseline methods in detecting distillation data. For instance, on the S1 dataset, when applied to a distilled model fine-tuned from Qwen2.5-32B-Instruct, TBD achieved an AUC (Area Under the Receiver Operating Characteristic curve) of 0.918 and a TPR@1% FPR (True Positive Rate at 1% False Positive Rate) of 0.470. These metrics indicate strong detection performance, even under strict conditions where false positives are minimized.
The study also explored the robustness of TBD, showing its consistent performance across different model sizes (from 7B to 32B parameters) and various distillation datasets. The method’s performance was also found to be stable with varying distillation data sizes and truncation lengths (the number of generated tokens considered for scoring). Furthermore, a tunable parameter within TBD, denoted as alpha, allows for flexible adjustment to prioritize different evaluation metrics, making it practical for real-world applications.
Also Read:
- New Research Reveals Critical Vulnerabilities in AI Model Contamination Detection
- DMark: A Novel Watermarking Framework for Diffusion Large Language Models
In conclusion, the Token Probability Deviation method offers a practical and effective solution for identifying distillation data in reasoning models. By leveraging the unique token generation patterns of distilled models, it addresses the critical challenge of partial data availability, contributing to greater transparency and fairness in the evaluation of advanced AI systems. You can read the full research paper here.


