TLDR: A new research paper introduces a label-free method called uncertainty-based filtering to create high-quality synthetic datasets for training large reasoning models (LRMs) in biology. By using a model’s own confidence (measured by self-consistency and predictive perplexity) to filter synthetic reasoning traces, the approach significantly improves LRM performance in tasks like biological perturbation prediction, reducing the reliance on expensive wet-lab data and enabling more efficient AI development in label-scarce domains.
Training advanced artificial intelligence models, especially Large Reasoning Models (LRMs), to understand complex biological processes has always faced a significant hurdle: the scarcity and high cost of ground-truth labels. In fields like biology, obtaining accurate experimental data often requires expensive and time-consuming wet-lab experiments. This bottleneck limits the potential of AI in crucial areas such as drug discovery and disease modeling.
A new research paper, titled “Towards Label-Free Biological Reasoning: Synthetic Dataset Creation via Uncertainty Filtering,” proposes an innovative solution to this challenge. The authors, Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Tom Lamb, Philip Torr, Marc Boubnovski Martell, Julien Fauqueur, and Kaspar Märtens from Novo Nordisk and the University of Oxford, introduce a label-free method that allows AI models to essentially ‘self-curate’ their training data.
The Core Problem: Expensive Biological Labels
Imagine trying to predict how a new drug or a gene modification will affect a cell. This task, known as cellular perturbation prediction, is fundamental to understanding diseases and developing new treatments. However, getting the ‘right answer’ (the ground-truth label) for each prediction often means conducting costly and labor-intensive lab experiments. This makes it incredibly difficult to generate the large, high-quality datasets needed to train powerful AI models effectively.
A Smart Solution: Uncertainty-Based Filtering
The researchers’ breakthrough lies in using the AI model’s own confidence as a substitute for external labels. Instead of relying on human-annotated or experimentally derived labels, their method, called uncertainty-based filtering, leverages established uncertainty metrics like self-consistency and predictive perplexity. Here’s how it works:
First, the AI model generates multiple possible reasoning steps, or “chain-of-thought” (CoT) traces, along with a predicted outcome for a given biological scenario. For example, it might predict whether a gene’s expression will go up, down, or remain unchanged after a specific perturbation.
Next, the model evaluates its own confidence in each of these generated traces. It uses a combined metric called CoCoA, which assesses how consistent the different reasoning traces are with each other (self-consistency) and how ‘surprised’ the model is by its own predictions (predictive perplexity). Traces where the model is highly confident (low uncertainty) are considered more reliable.
Finally, only the most confident, low-uncertainty traces are selected to form a high-quality synthetic dataset. This filtered dataset is then used to fine-tune the LRM, teaching it to reason more accurately without ever needing a single ground-truth label from a lab experiment.
Also Read:
- Enhancing LLM Training: Focusing on Local Steps for Better Reasoning
- Boosting Large Language Model Trustworthiness with Adaptive Voting Ensembles
Key Findings and Impact
The results of this label-free approach are compelling. When applied to biological perturbation prediction, the uncertainty-filtered data consistently showed higher accuracy. Training LRMs on this self-curated data significantly outperformed models trained on unfiltered synthetic data or randomly sampled data. In fact, it substantially narrowed the performance gap compared to models trained with expensive ground-truth labels.
The study also highlighted the importance of ‘per-class filtering,’ meaning that uncertainty should be assessed and filtered separately for different types of predictions (e.g., ‘upregulated’ vs. ‘downregulated’ genes). This ensures a balanced and high-quality dataset across all possible outcomes. Furthermore, combining different uncertainty signals, as done with the CoCoA metric, proved more effective than using any single signal alone.
This research suggests that AI models can become more autonomous in their learning, capable of identifying and utilizing high-quality training data from their own generations. This has profound implications for fields where data labeling is a major bottleneck, potentially accelerating discoveries in drug development, personalized medicine, and our fundamental understanding of biological systems. To learn more about this innovative approach, you can read the full paper here.


