TLDR: An AI system combining Large Language Models and Tree Search automates the creation of expert-level scientific software for “scorable tasks.” It has achieved superhuman performance in diverse fields like bioinformatics, epidemiology, geospatial analysis, neuroscience, time series forecasting, and numerical analysis, significantly accelerating scientific discovery by rapidly generating and optimizing solutions.
Scientific discovery, a cornerstone of human progress, often faces a significant bottleneck: the slow and labor-intensive process of creating specialized software for computational experiments. To overcome this challenge, a groundbreaking AI system has been developed by researchers from Google DeepMind and Google Research.
This innovative system is designed to automatically generate and refine expert-level scientific software. Its core methodology involves a powerful combination of a Large Language Model (LLM) and Tree Search (TS). The LLM acts as a creative engine, proposing and rewriting software solutions, while the Tree Search systematically explores a vast landscape of possibilities, intelligently navigating towards solutions that maximize a predefined quality metric. This approach allows the system to continuously improve the software it creates, often by integrating complex research ideas from various external sources.
The researchers introduce the concept of “empirical software” – software specifically designed to achieve the highest possible score on a measurable quality metric. Tasks that can be solved with such software are termed “scorable tasks.” The paper highlights two key hypotheses: first, that scorable tasks are widespread across nearly all scientific, applied mathematics, and engineering fields; and second, that developing empirical software for these tasks is typically a slow and arduous process, often relying on intuition rather than systematic exploration.
The effectiveness of this AI system has been demonstrated across a wide array of scientific benchmarks, achieving results that often surpass human-developed methods. For instance, in the field of bioinformatics, the system discovered 40 new methods for analyzing single-cell data, outperforming the best human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that proved more accurate than the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations.
Beyond these, the system has also produced state-of-the-art software for complex tasks such as geospatial analysis, predicting neural activity in zebrafish brains, time series forecasting, and numerically solving difficult integrals. Its success stems from its ability to tirelessly and exhaustively search for high-quality solutions at an unprecedented scale, often identifying “needle-in-the-haystack” solutions that humans might miss.
A crucial aspect of the system’s performance is its capacity to incorporate and recombine research ideas. This includes drawing insights from highly cited papers, specialized textbooks, and even automatically generated ideas from other LLM-driven search strategies like Gemini Deep Research and AI co-scientist. By synthesizing the strengths of existing approaches and generating novel hybrid strategies, the AI system consistently achieves superior performance.
Also Read:
- Paper2Agent: Transforming Research Papers into Interactive AI Assistants
- Structuring Intelligence: Language Models Crafting Hierarchical Learning Environments for AI Agents
The implications of this technology are profound. By dramatically accelerating the creation of scientific software – reducing development time from weeks or months to mere hours or days – the system represents a significant leap towards accelerating scientific progress across various disciplines. The authors believe that fields where solutions can be objectively scored by machines are on the verge of a revolutionary acceleration in discovery. You can read the full research paper here: An AI system to help scientists write expert-level empirical software.


