Measuring AI Agent Performance and Safety in Online Shopping

TLDR: A new benchmark called Amazon-Bench has been introduced to better evaluate AI web agents on e-commerce platforms like Amazon. Unlike previous benchmarks that focused mainly on product search, Amazon-Bench includes a wider range of tasks such as account management and gift card operations. Crucially, it also assesses agent safety, identifying “harmful failures” where agents might negatively impact a user’s account (e.g., buying the wrong item). The research found that current AI agents struggle with complex tasks and pose significant safety risks, highlighting the need for more robust and reliable development.

Web agents, powered by advanced large language models (LLMs), hold immense potential for automating various tasks on e-commerce websites. From searching for products to managing user accounts, these agents promise to streamline online interactions. However, a recent research paper introduces a new benchmark, Amazon-Bench, that reveals current evaluation methods fall short and highlights significant challenges in agent performance and safety.

The Gaps in Current Evaluation

Existing benchmarks for e-commerce web agents primarily suffer from two major limitations. Firstly, they tend to focus almost exclusively on product search tasks, such as “Find an Apple Watch.” This narrow scope fails to capture the vast array of functionalities offered by real-world e-commerce platforms like Amazon, which include complex operations like account management, address updates, wishlist creation, and gift card operations.

Secondly, current evaluations often overlook the critical aspect of safety. Agents might successfully complete a user query but inadvertently cause negative impacts, such as purchasing the wrong item, deleting a saved address, or incorrectly configuring auto-reload settings for a gift card. These unintended changes pose real risks to users and are not adequately assessed by existing benchmarks.

Introducing Amazon-Bench: A Comprehensive Approach

To address these crucial gaps, researchers from The Pennsylvania State University and Amazon have proposed Amazon-Bench. This new benchmark is designed to provide a more holistic and realistic evaluation of web agents in e-commerce environments.

At its core, Amazon-Bench utilizes a functionality-grounded user query generation pipeline. This innovative approach involves feeding real webpage content and interactive elements (like buttons and checkboxes) to LLMs. This allows the system to generate diverse and realistic user queries that span a broad range of tasks, including adding delivery addresses, managing wishlists, and interacting with brand stores.

Evaluating Performance and Safety

Beyond generating diverse tasks, Amazon-Bench introduces an automated evaluation framework that assesses both the performance and, critically, the safety of web agents. The framework categorizes outcomes into three types:

Success: The agent successfully completes the task without any negative impact.
Benign Failure: The agent fails to complete the task, but no changes are made to the user’s account or status.
Harmful Failure: The agent performs actions that result in a negative impact on the user, such as making an incorrect purchase, modifying account settings without explicit instruction, or adding unwanted items to the cart. Even if the main task is eventually completed, harmful side effects still classify it as a harmful failure.

This distinction is vital for understanding the true reliability of web agents in real-world scenarios where user data and financial transactions are involved.

Key Findings and Agent Challenges

The research systematically evaluated various agents, including Deepseek-R1, GPT-4o, GPT-o4-mini, Claude-3.7, GPT-4.1, WebVoyager, and Nova-Act, on the Amazon-Bench. The findings highlight significant challenges:

Struggling with Complexity: Current agents often struggle with complex queries that go beyond simple product searches. Tasks involving store interactions, for instance, proved particularly challenging.
Safety Risks: A notable finding was the presence of safety risks. Product interaction and account management tasks showed higher rates of harmful failures, where agents made unintended changes to user accounts or carts. For example, an agent might add two items to a cart when the user only requested one, or get stuck in a loop searching for a non-existent button.
Performance Variation: While GPT-4.1 achieved the highest overall success rate, Nova-Act demonstrated the lowest harmful failure rate. This indicates a trade-off between task completion and safety across different models.
Efficiency: Agents also varied in efficiency, measured by the number of steps taken and tokens used per query. GPT-4.1 was found to be the most efficient in terms of steps.

Also Read:

The Path Forward

The introduction of Amazon-Bench underscores the need for developing more robust and reliable web agents. While LLMs show promise, their current capabilities in handling diverse e-commerce functionalities and ensuring user safety require substantial improvement. The benchmark’s focus on functionality-grounded queries and a nuanced safety evaluation provides a crucial tool for guiding future research and development in this rapidly evolving field.

The paper also acknowledges limitations, such as its focus on single-turn tasks and the absence of user-context awareness, pointing to promising directions for future work in creating even more sophisticated and personalized AI agents for the web.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Measuring AI Agent Performance and Safety in Online Shopping

The Gaps in Current Evaluation

Introducing Amazon-Bench: A Comprehensive Approach

Evaluating Performance and Safety

Key Findings and Agent Challenges

The Path Forward

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates