TLDR: A recent workshop convened by the Chan Zuckerberg Initiative (CZI) addressed the critical need for standardized benchmarking of AI models in biology, particularly for ‘Virtual Cells.’ The paper highlights key challenges including data heterogeneity, reproducibility issues, the difficulty of defining biologically relevant evaluation metrics, a fragmented research ecosystem, and inherent biases in data. To overcome these, it recommends investing in high-quality data, developing standardized tools, adopting multi-faceted evaluation approaches, creating centralized platforms for resource sharing, fostering interdisciplinary community engagement, and ensuring benchmarks are continuously updated to remain relevant. These efforts are crucial for building trustworthy and impactful AI models that can advance biological discovery and therapeutic development.
Artificial intelligence (AI) holds immense potential to revolutionize our understanding of biology, from simulating cellular functions to accelerating drug discovery. However, unlocking this potential requires a crucial element: robust and standardized ways to evaluate these AI models. Without proper benchmarks, it’s difficult to compare different models, ensure their reliability, and truly understand their impact on complex biological problems.
A recent workshop, convened by the Chan Zuckerberg Initiative (CZI), brought together leading experts in machine learning and computational biology. These specialists, spanning fields like imaging, transcriptomics, proteomics, and genomics, gathered to address the significant gaps in how AI models are currently benchmarked in biology. Their discussions focused on the ambitious goal of building ‘Virtual Cells’ – computational representations that integrate vast amounts of multi-scale, multi-modal biological data to simulate cellular behavior.
Key Challenges Identified
The workshop highlighted several major hurdles preventing the widespread adoption and trustworthiness of AI in biology:
- Data Issues: Unlike other AI fields with massive, clean datasets, biological data is often smaller, highly varied, imbalanced, and noisy. This is due to high data generation costs, complex acquisition, and curation challenges. Issues like data sparsity, ‘batch effects’ (variations from different experiments or instruments), and a lack of standardized formats make data sharing and comparison difficult. Privacy concerns, especially with personal genomic information, add another layer of complexity. A critical problem is ‘data leakage,’ where training data inadvertently contaminates evaluation data, leading to overly optimistic performance estimates.
- Reproducibility: The complexity of AI code, often involving diverse tools and implementations, makes it hard to debug and maintain. Poor documentation and a lack of incentives for researchers to prioritize reproducibility further compound this issue, hindering the validation and replication of results.
- Biological Relevance of Metrics: Many machine learning tasks have clear, quantifiable outcomes. However, biological hypotheses are often multifaceted. Relying on single performance metrics can distort model development and misdirect research efforts. Translating complex biological questions into clear, quantifiable metrics that accurately reflect real-world biological context remains a significant challenge.
- Fragmented Ecosystem: Data, leaderboards, models, publications, and resources are scattered across numerous platforms. This fragmentation makes it difficult for researchers to discover relevant benchmarks or compare results effectively, slowing down collaborative progress.
- Biases: Biological datasets can suffer from various biases, including technical, measurement, and systematic biases from study design or recruitment. For instance, large biomedical data repositories often show genetic ancestral and gender biases. There’s also a tendency for research to focus on well-characterized genes, neglecting other important areas, and a publication bias towards positive findings, skewing the scientific literature.
- Benchmark Maintenance: Both biological technologies and AI models evolve rapidly. This necessitates a dynamic approach to benchmarking, where evaluation metrics and datasets are continuously updated to remain relevant.
- Community Development: Developing and adopting standardized benchmarks requires broad collaboration across academia, industry, and non-profits. Differing incentives and a lack of shared language across these diverse domains can hinder collective efforts.
Also Read:
- Diffusion Models: Advancing Small Molecule Design for Drug Discovery
- VerifyBench: A New Benchmark for Evaluating AI Reasoning Verifiers
Recommendations for Progress
To overcome these challenges and accelerate the development of robust AI models for biology, the workshop participants proposed several key recommendations:
- Invest in High-Quality Data: Prioritize creating well-annotated datasets specifically designed for benchmarking. These datasets should align with existing biological ontologies to ensure interoperability and be released with clear guidelines to prevent data leakage.
- Develop Standardized Tooling: Focus on improving code management through modular design, standardized documentation, and incentives for reproducible workflows. This includes using containerization for technical replicability, effective data splitting for statistical replicability, and comprehensive documentation for conceptual replicability.
- Utilize a Multi-Faceted Evaluation Approach: Incorporate diverse metrics that capture various aspects of model performance, such as accuracy, robustness, and generalizability. These metrics should be chosen in close collaboration with domain experts (e.g., cell biologists, physicians, machine learning researchers) and stakeholders to ensure biological relevance.
- Create a Centralized Platform: Establish a hub for sharing benchmarks, datasets, models, and evaluation tools. This platform would foster collaboration, facilitate knowledge transfer, and accelerate advancements by providing easy access to community-recommended resources.
- Engage an Interdisciplinary Community: Broadly define and inclusively involve experts from computational modeling, software engineering, data science, biology, and clinical practice. Regular forums for discussion and transparent reporting of progress, similar to the successful CASP challenge in protein structure prediction, are crucial.
- Ensure Benchmarks Remain Relevant: Implement a flexible approach to benchmarking, with clear mechanisms for updating or deprecating benchmarks that no longer reflect the field’s advancements.
- Foster Cross-Sector Collaboration: Promote shared resources and best practices through collaborative discussions, including distributed and asynchronous options to be inclusive of the international community.
In conclusion, rigorous benchmarking is fundamental for building credible and impactful AI models in biology. The CZI workshop emphasized that achieving the vision of AI-driven Virtual Cells requires a cultural shift: prioritizing rigorous evaluation, investing in high-quality data, adopting standardized tools, using multi-faceted evaluation metrics, and building interoperable, community-driven platforms. This path demands sustained, interdisciplinary collaborations to bridge the expertise of experimentalists, computer scientists, clinicians, and engineers. By embracing these principles, the scientific community can accelerate innovation and ensure that AI models accurately represent the complex biological realities they aim to understand, diagnose, and treat human disease. For more details, you can refer to the full research paper: Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop.


