TLDR: Scale AI and the Center for AI Safety (CAIS) have introduced the Remote Labor Index (RLI), a new benchmark evaluating AI agents’ ability to complete real-world freelance projects. The initial findings show a low automation rate of just 2.5% across diverse tasks, suggesting AI’s current role is more about augmentation than widespread job replacement, though steady progress is noted.
Scale AI, a leader in data for artificial intelligence, in collaboration with the Center for AI Safety (CAIS), has unveiled the Remote Labor Index (RLI), a groundbreaking benchmark designed to empirically measure the capability of AI agents in performing real-world, economically valuable remote work. The index, introduced to bridge the gap between AI’s performance on isolated research benchmarks and its actual impact on labor automation, presents a comprehensive evaluation of AI agents across a diverse range of freelance projects.
The initial findings from the RLI indicate that current state-of-the-art AI agents achieve a maximum automation rate of only 2.5% on these complex, end-to-end projects. This low success rate suggests that contemporary AI systems are not yet capable of autonomously completing the vast majority of professional tasks to a client-ready standard. As stated in the research, ‘The fear of imminent, widespread automation is not supported by the data; the 97.5% failure rate shows that AI is not yet capable of autonomously performing complex, professional work.’
The RLI dataset comprises 240 real-world projects spanning 23 domains, including game development, product design, architecture, data analysis, and video animation. These projects were sourced from 358 verified freelancers on the Upwork platform, representing over 6,000 hours of human work valued at a combined total of $143,991. Each project includes a clear brief, input files, a human-produced deliverable, and economic data on completion time and cost.
Despite the low absolute automation rate, the RLI also reveals a ‘steady relative improvement’ in AI capabilities. Elo scores, used to track agent performance, demonstrate that newer frontier models consistently rank higher than older ones. This indicates that while full project automation is still distant, measurable progress is being made in AI’s ability to tackle complex tasks. The 2.5% success, though small, is significant, showing that ‘AI is already at a professional level for some generative tasks (creating images, audio, or code from scratch).’
The developers emphasize that the RLI aims to ground discussions about AI automation in empirical evidence, providing a common basis for tracking progress and enabling stakeholders to proactively navigate the impacts of AI-driven labor automation. The benchmark highlights a critical gap between AI’s skill on isolated tasks and the end-to-end reliability required for real-world client briefs, suggesting that the immediate impact of AI is likely to be augmentation rather than mass replacement.
Also Read:
- New Benchmarking Suite Terminal-Bench 2.0 and Agent Testing Framework Harbor Launched to Advance AI Agent Evaluation
- The Human-AI Nexus: Anticipating a Transformative 2026
Limitations of the RLI include the reliance on rigorous manual evaluation, which is time-consuming and expensive, and incomplete project coverage of the entire digital economy. There is also a risk of benchmark contamination if future models inadvertently train on the publicly released projects. However, the RLI provides an invaluable tool for guiding and measuring the next phase of AI development, focusing on building agents capable of moving from simple prompts to complex project execution.


