Evaluating AI Agents for Salesforce Tasks: Introducing SCUBA

TLDR: SCUBA is a new benchmark for evaluating AI computer-use agents on realistic customer relationship management (CRM) workflows within the Salesforce platform. It features 300 tasks derived from real user interviews across administrator, sales, and service personas, testing abilities like UI navigation, data manipulation, and troubleshooting. Initial benchmarks show significant performance gaps between different agent designs and models, with closed-source models outperforming open-source ones. Human demonstrations significantly improve agent success rates, while also reducing time and costs, highlighting both the challenges and promise of AI in complex enterprise software.

Salesforce AI Research has unveiled a groundbreaking new benchmark called SCUBA, short for Salesforce Computer Use Benchmark. This innovative tool is designed to rigorously evaluate how well AI agents can automate complex customer relationship management (CRM) workflows within the Salesforce platform. It addresses a crucial gap in existing benchmarks, which often don’t fully capture the intricacies and challenges of real-world enterprise software environments.

SCUBA is built on a foundation of realism, featuring 300 task instances that were directly derived from interviews with actual Salesforce users. These tasks span three key personas: platform administrators, sales representatives, and service agents, ensuring a comprehensive evaluation across critical business functions. The benchmark tests a wide array of enterprise-essential abilities, including navigating complex user interfaces, manipulating data, automating workflows, retrieving information, and troubleshooting common issues.

To ensure authenticity, SCUBA operates within live Salesforce sandbox environments. It supports parallel execution, allowing for efficient testing of numerous agents simultaneously. Furthermore, it provides fine-grained evaluation metrics that go beyond simple success rates, capturing milestone progress, latency (time taken), and operational costs. This multi-dimensional approach offers a clearer picture of an agent’s real-world viability.

The research paper details experiments with a diverse set of AI agents, tested in both ‘zero-shot’ (without prior examples) and ‘demonstration-augmented’ settings. The findings reveal significant performance disparities. In the zero-shot scenario, open-source models, even those strong in related benchmarks like OSWorld, achieved less than a 5% success rate on SCUBA tasks. In contrast, methods built on closed-source models managed up to a 39% task success rate.

A key discovery was the profound impact of human demonstrations. When agents were augmented with these demonstrations, task success rates improved significantly, reaching up to 50%. This improvement also came with tangible benefits in efficiency, reducing task completion time by 13% and costs by 16%. This highlights the potential for AI agents to learn from human expertise to navigate complex business software more effectively.

The study also compared ‘browser-use’ agents, which interact with the web page’s underlying code (DOM texts and Set-of-Mark screenshots), against ‘computer-use’ agents, which primarily rely on screenshots of the entire desktop. Generally, browser-use agents achieved higher success rates, attributed to the stronger planning capabilities of their underlying foundation models and richer observation spaces. However, computer-use agents often demonstrated lower latency. The research suggests that a browser-use agent powered by Gemini-2.5-Pro, especially with demonstrations, offers a strong balance between success rates, latency, and costs.

Also Read:

Despite the advancements, the paper points out persistent challenges for computer-use agents, particularly in generalization, planning, and ‘grounding’ (accurately identifying and interacting with specific UI elements). Improving these areas is crucial for their broader adoption. SCUBA aims to accelerate progress in building reliable computer-use agents for the intricate world of business software ecosystems. You can read the full research paper for more details here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating AI Agents for Salesforce Tasks: Introducing SCUBA

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates