TLDR: SCUBA is a new benchmark for evaluating AI computer-use agents on realistic customer relationship management (CRM) workflows within the Salesforce platform. It features 300 tasks derived from real user interviews across administrator, sales, and service personas, testing abilities like UI navigation, data manipulation, and troubleshooting. Initial benchmarks show significant performance gaps between different agent designs and models, with closed-source models outperforming open-source ones. Human demonstrations significantly improve agent success rates, while also reducing time and costs, highlighting both the challenges and promise of AI in complex enterprise software.
Salesforce AI Research has unveiled a groundbreaking new benchmark called SCUBA, short for Salesforce Computer Use Benchmark. This innovative tool is designed to rigorously evaluate how well AI agents can automate complex customer relationship management (CRM) workflows within the Salesforce platform. It addresses a crucial gap in existing benchmarks, which often don’t fully capture the intricacies and challenges of real-world enterprise software environments.
SCUBA is built on a foundation of realism, featuring 300 task instances that were directly derived from interviews with actual Salesforce users. These tasks span three key personas: platform administrators, sales representatives, and service agents, ensuring a comprehensive evaluation across critical business functions. The benchmark tests a wide array of enterprise-essential abilities, including navigating complex user interfaces, manipulating data, automating workflows, retrieving information, and troubleshooting common issues.
To ensure authenticity, SCUBA operates within live Salesforce sandbox environments. It supports parallel execution, allowing for efficient testing of numerous agents simultaneously. Furthermore, it provides fine-grained evaluation metrics that go beyond simple success rates, capturing milestone progress, latency (time taken), and operational costs. This multi-dimensional approach offers a clearer picture of an agent’s real-world viability.
The research paper details experiments with a diverse set of AI agents, tested in both ‘zero-shot’ (without prior examples) and ‘demonstration-augmented’ settings. The findings reveal significant performance disparities. In the zero-shot scenario, open-source models, even those strong in related benchmarks like OSWorld, achieved less than a 5% success rate on SCUBA tasks. In contrast, methods built on closed-source models managed up to a 39% task success rate.
A key discovery was the profound impact of human demonstrations. When agents were augmented with these demonstrations, task success rates improved significantly, reaching up to 50%. This improvement also came with tangible benefits in efficiency, reducing task completion time by 13% and costs by 16%. This highlights the potential for AI agents to learn from human expertise to navigate complex business software more effectively.
The study also compared ‘browser-use’ agents, which interact with the web page’s underlying code (DOM texts and Set-of-Mark screenshots), against ‘computer-use’ agents, which primarily rely on screenshots of the entire desktop. Generally, browser-use agents achieved higher success rates, attributed to the stronger planning capabilities of their underlying foundation models and richer observation spaces. However, computer-use agents often demonstrated lower latency. The research suggests that a browser-use agent powered by Gemini-2.5-Pro, especially with demonstrations, offers a strong balance between success rates, latency, and costs.
Also Read:
- Unlocking Data Insights: A Framework for Transparent and Evaluable AI Agents
- Understanding Why AI Agent Systems Fail: A Deep Dive into Root Causes
Despite the advancements, the paper points out persistent challenges for computer-use agents, particularly in generalization, planning, and ‘grounding’ (accurately identifying and interacting with specific UI elements). Improving these areas is crucial for their broader adoption. SCUBA aims to accelerate progress in building reliable computer-use agents for the intricate world of business software ecosystems. You can read the full research paper for more details here.


