Tailoring LLM Evaluation: A Dataset for Responsible AI in E-commerce

TLDR: This research introduces a novel, use-case specific dataset designed to evaluate large language models (LLMs) for Responsible AI dimensions like fairness, safety, veracity, and quality within a real-world application: generating product descriptions. Unlike general benchmarks, this dataset focuses on specific attributes (identity groups, gendered adjectives, product categories, including high-risk ones) to uncover performance disparities and risks relevant to e-commerce, demonstrating how LLMs perform differently across various demographic cohorts and product types.

Large Language Models (LLMs) are becoming increasingly integrated into various applications, from content generation to customer service. However, evaluating these powerful AI systems for responsible performance, particularly in areas like fairness, safety, and veracity, presents a significant challenge. Traditional evaluation methods often rely on high-level, general tasks that don’t adequately capture the nuances of specific real-world applications.

Addressing the Gap in LLM Evaluation

A new research paper, “A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text,” by Alicia Sagae, Chia-Jung Lee, Sandeep Avula, Brandon Dang, and Vanessa Murdock, tackles this very issue. The authors highlight that what constitutes “fair” or “safe” can vary dramatically depending on the AI application. For instance, the safety requirements for an application generating children’s Halloween costume descriptions would differ greatly from one summarizing horror films.

To address this, the researchers have constructed a unique dataset specifically designed for a common text generation use case: creating plain-text product descriptions from a list of features. This dataset is curated with an e-commerce seller in mind, aiming to provide a more realistic and granular assessment of LLM performance in Responsible AI (RAI) dimensions.

How the Dataset Was Built

The dataset’s construction is meticulous, focusing on capturing attributes crucial for evaluating fairness and safety. It involves:

Query Templates: These templates combine product adjectives, product categories, and identity groups (e.g., “cute products for women,” “strong products in electronics for Latino people”).

Identity Groups: Drawing from the Toxigen dataset, 13 identity groups are included, covering attributes like race, ethnicity, age, religion, disability status, sexual orientation, and gender identity.

Gendered Adjectives: A small set of adjectives (e.g., “superior,” “adorable,” “sexy”) were selected based on their association with gendered word clusters.

Product Categories: Eight categories associated with “man” and eight with “woman” were chosen from the Amazon.com catalog. Crucially, this includes six “high-risk” categories (e.g., Sexual Wellness, Shooting) identified for their potential to generate toxic language.

These methods generated 382 unique search queries, which were then submitted to Amazon.com to retrieve product details. After cleaning, the dataset comprises 7047 rows, representing 5145 unique products. Each row includes the product title, description, feature bullets (considered ground truth), and the query template used to retrieve it, all labeled for fairness attributes.

Evaluating LLMs with the New Dataset

To demonstrate the dataset’s utility, the researchers conducted a sample analysis using the Llama 3.2 11B model. They evaluated four key Responsible AI dimensions:

Quality: Measured by semantic accuracy, comparing LLM output to human-written ground truth descriptions.

Veracity: Assessed by BertScore precision and recall, focusing on the accuracy and completeness of factual information.

Safety: Determined by a toxicity score (from the unbiased detoxify classifier), indicating the likelihood of harmful or toxic content.

Fairness: Evaluated using a “cohort disparity” meta-metric, comparing performance (toxicity and accuracy) across different identity groups, product categories, and adjectives.

Key Findings and Insights

The evaluation revealed several important insights:

High Overall Quality: The model showed high semantic similarity to human-written descriptions, with a mean accuracy of 0.9496.

Veracity Variations: While generally good, some LLM outputs included “hallucinated” words or omitted information, leading to variations in precision and recall.

Contextual Safety: Mean toxicity was low overall, but high-risk categories showed significant spikes. For example, “Sexual Wellness” products generated descriptions with high “sexually explicit” toxicity scores, while “Shooting” products scored high in “threat” sub-types. This highlights that toxicity definitions need to be aligned with the specific use case.

Fairness Disparities: The dataset effectively uncovered significant fairness gaps. There was a 21-fold increase in toxicity between the least toxic category (Appliances) and the most toxic (Sexual Wellness). More strikingly, identity groups revealed notable differences; for instance, products associated with the “Women” group resulted in significantly higher scores for sexually explicit language, even if their overall toxicity was moderate.

The study also suggested that for certain use cases, smaller LLMs might perform close enough to larger models, offering a potential for resource savings without significant performance degradation.

Also Read:

Looking Ahead

This work provides a valuable resource for the research community, offering a concrete method and dataset for evaluating LLMs in a use-case specific manner. While the dataset relies on human-written ground truth (which may contain inherent biases), it supports a wide range of evaluation metrics. The authors acknowledge limitations, such as the binary gender associations and the implicit nature of product-identity group links derived from search engine results, and suggest future extensions to cover multimodal or multilingual components.

Ultimately, this research underscores the importance of application-specific evaluation for Responsible AI, enabling developers to identify and address performance disparities and risks in LLM-generated text for realistic end-user applications.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Tailoring LLM Evaluation: A Dataset for Responsible AI in E-commerce

Addressing the Gap in LLM Evaluation

How the Dataset Was Built

Evaluating LLMs with the New Dataset

Key Findings and Insights

Looking Ahead

Gen AI News and Updates

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Vatican Summit Addresses Ethical Imperatives of AI in Healthcare

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates