TLDR: FML-bench is a new benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental ML problems, moving beyond engineering-focused assessments. It uses real-world codebases and a five-dimensional evaluation framework (Utility, Diversity, Academic Contribution Rate, Cost, Step Success Rate). Key findings indicate that agents with broad research exploration strategies outperform those with narrow, deep exploration, and there’s a positive correlation between idea diversity and performance. The benchmark provides insights for developing more effective AI research agents.
The world of artificial intelligence is constantly evolving, with large language models (LLMs) increasingly taking on complex tasks, including those in machine learning research. These AI agents are becoming capable of proposing new ideas and conducting experiments autonomously, aiming to accelerate scientific discovery. However, evaluating the true scientific capabilities of these agents has been a significant challenge. Existing benchmarks often focus too much on the engineering aspects of machine learning, like optimizing code or managing data pipelines, and less on the fundamental research problems that drive innovation.
This is where a new benchmark called FML-bench comes in. Developed by researchers from the National University of Singapore, Tsinghua University, and the University of Minnesota, FML-bench is designed to provide a more comprehensive and academically rigorous evaluation for automatic machine learning research agents. It tackles the limitations of previous benchmarks by focusing on fundamental research problems, offering greater task diversity, and being scalable to real-world GitHub repositories.
What is FML-bench?
FML-bench is a new standard for evaluating how well AI agents can perform machine learning research. It includes 8 diverse and fundamental machine learning research problems, moving beyond simple application-oriented tasks. The benchmark is built on four key principles:
- Fundamental ML Problems: It focuses on core scientific challenges, such as how models generalize to new data or learn from limited examples, rather than just achieving high scores on leaderboards.
- Real-World Codebases: Tasks are based on existing research repositories, mimicking how real scientists adapt and build upon previous code.
- Extensibility: The design allows for easy integration of new machine learning GitHub repositories, making it highly adaptable.
- Low Coding Barrier: Agents start with provided baseline code, allowing them to concentrate on algorithmic and architectural advancements rather than building entire codebases from scratch.
Diverse Research Challenges
The 8 tasks within FML-bench cover a broad spectrum of critical machine learning areas:
- Generalization: How well models perform on unseen data or different environments.
- Data Efficiency: Learning effectively from very few examples.
- Representation Learning: Discovering meaningful features from data.
- Continual Learning: Retaining knowledge over time without forgetting previous learning.
- Causality: Understanding cause-and-effect relationships.
- Robustness and Reliability: Ensuring models are resilient to attacks or corrupted data.
- Privacy: Protecting sensitive information from being leaked.
- Fairness and Bias: Ensuring equitable performance across different groups.
A Unified Evaluation Framework
To assess agents holistically, FML-bench introduces five complementary metrics:
- Utility: Measures the empirical performance improvement of the agent’s proposed solution.
- Diversity: Quantifies the variety of code modifications and hypotheses an agent explores.
- Academic Contribution Rate: Distinguishes between scientifically meaningful changes (e.g., new algorithms) and purely engineering adjustments (e.g., hyperparameter tuning).
- Cost: Accounts for computational resources and time spent.
- Step Success Rate: Reflects the agent’s reliability in producing valid, bug-free results across multiple steps.
Also Read:
- AI Agents Reshape Scientific Discovery: A New Paradigm for Research
- Operand Quant: A Single Agent Redefines Autonomous Machine Learning Engineering
Key Discoveries from the Benchmark
The researchers evaluated several state-of-the-art automatic research agents, including TheAIScientist, AIDE, and Claude Code, using different LLMs like Gemini-2.5-Pro and GPT-5. A central finding was that agents employing broad research exploration strategies, like TheAIScientist, consistently outperformed those that focused on narrow but deep exploration. This suggests that generating a wider variety of ideas more reliably leads to successful methods than repeatedly refining a single one. There was a clear positive correlation between the diversity of ideas and the improvement in performance.
Interestingly, Gemini-2.5-Pro generally outperformed GPT-5 in this evaluation. The study also noted that general-purpose, command-line interface (CLI) style agents, such as Claude Code, often struggled with multi-step tasks due to premature termination, indicating they might be less suitable for complex, iterative machine learning research compared to specialized agents.
In conclusion, FML-bench provides a robust and practical foundation for evaluating the capabilities of research agents. Its findings offer valuable guidance for designing more effective, generalizable, and scientifically productive AI research agents by highlighting the critical importance of exploration breadth in the research process. You can find more details about this benchmark and its open-sourced code at the project’s GitHub repository. Read the full research paper here.


