The Remote Labor Index: A New Measure for AI Automation

TLDR: The Remote Labor Index (RLI) is a new benchmark of 240 real-world, economically valuable remote work projects designed to measure AI automation. Sourced from freelance platforms, RLI projects are complex and diverse. Current frontier AI agents achieve a low automation rate of 2.5%, indicating they are far from autonomously performing most remote labor. However, relative performance scores show steady improvement among models. The RLI provides an empirical basis for tracking AI’s impact on the workforce.

A new study introduces the Remote Labor Index (RLI), a groundbreaking benchmark designed to empirically measure how well AI can automate real-world remote work. This index aims to provide a clear, standardized way to track AI’s impact on the workforce, moving beyond theoretical benchmarks to evaluate AI agents on economically valuable projects.

The RLI is unique because it comprises 240 complete projects sourced directly from online freelance platforms. These aren’t simplified tasks; they represent actual work performed by human professionals, complete with original project briefs and gold-standard human deliverables. This approach ensures the benchmark is grounded in real economic transactions and captures the true diversity and complexity of the remote labor market, including areas like game development, product design, architecture, and data analysis.

To create the RLI, researchers engaged with 358 experienced freelancers, collecting 550 initial projects. These projects underwent a rigorous cleaning and filtering process to ensure they were self-contained, reproducible, and met specific criteria, such as not requiring physical labor or client interaction. The final dataset spans 23 categories of work from the Upwork taxonomy and involves a wide variety of file formats, making it far more diverse than previous AI benchmarks.

The projects within the RLI are also significantly more complex, with human professionals spending an average of 28.9 hours and a median of 11.5 hours to complete them. The average cost of these projects was $632.6, with a median of $200, totaling over 6,000 hours of work valued at more than $140,000 across the dataset. This demonstrates the substantial economic value and difficulty captured by the RLI.

Researchers evaluated several leading AI agents, including ChatGPT agent, GPT-5, Claude Sonnet 4.5, Grok 4, Gemini 2.5 Pro, and Manus, on the RLI. The results revealed that current AI agents perform near the floor, with the highest-performing agent achieving an automation rate of only 2.5%. This means that less than 3% of the projects were completed by AI at a quality level comparable to or exceeding human work, indicating a significant gap between current AI capabilities and the demands of real-world remote labor.

Despite the low absolute automation rates, the study also used an Elo-based scoring system to measure the relative performance of different AI agents. This metric showed that models are steadily improving, with newer frontier models generally achieving higher scores than older ones. This suggests that while full automation is still distant, AI capabilities are progressing, and the RLI is sensitive enough to track these granular shifts.

Qualitative analysis of AI failures highlighted common issues such as technical and file integrity problems (corrupt or empty files), incomplete or malformed deliverables, poor overall quality, and inconsistencies across generated files. Successful AI deliverables, though few, were predominantly in creative projects like audio and image generation, as well as writing and data retrieval tasks, where current AI strengths are more pronounced.

Also Read:

The RLI provides an essential empirical foundation for understanding AI automation. It moves beyond specialized skill evaluations to assess end-to-end project completion in economically valuable contexts. This benchmark will be crucial for researchers, policymakers, and the public to monitor AI’s evolving capabilities and proactively address its potential impacts on the future of work. You can find more details about this research in the full paper: Remote Labor Index: Measuring AI Automation of Remote Work.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

The Remote Labor Index: A New Measure for AI Automation

Gen AI News and Updates

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Malaysia Forges Ahead with AI Development, Prioritizing Governance and Ethical Frameworks

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates