Unpacking LLM Safety: Insights from Aymara AI's Comprehensive Evaluation

TLDR: Aymara AI introduces a platform for automated, customizable safety evaluations of large language models (LLMs). Their “Aymara LLM Risk and Responsibility Matrix” evaluated 20 commercial LLMs across 10 safety domains, revealing significant performance disparities. While LLMs performed well in established areas like misinformation, they consistently struggled with complex domains such as privacy and impersonation, highlighting the inconsistent nature of LLM safety and the need for adaptable evaluation tools.

As large language models (LLMs) become increasingly integrated into various real-world applications, ensuring their safety and responsible deployment is paramount. However, evaluating AI safety is a complex and multifaceted challenge, as what constitutes ‘safe’ behavior can vary significantly based on the application domain, legal frameworks, cultural norms, and evolving threat landscapes.

A new research paper, titled “AUTOMATED SAFETY EVALUATIONS ACROSS 20 LARGE LANGUAGE MODELS: THE AYMARA LLM RISK AND RESPONSIBILITY MATRIX” by Juan Manuel Contreras of Aymara, introduces a novel approach to address this challenge. The paper details Aymara AI, a programmatic platform designed to generate and administer customized, policy-grounded safety evaluations for generative AI models.

Introducing Aymara AI: A New Approach to Safety Evaluation

Aymara AI works by transforming natural-language safety policies into adversarial prompts. These prompts are then fed to LLMs, and their responses are scored by an AI-based rater, which has been validated against human judgments. This innovative system aims to provide a scalable and rigorous method for assessing LLM safety, moving beyond static benchmarks to offer dynamic, customizable evaluations.

The platform is designed to be highly adaptable, supporting diverse and evolving definitions of safety across different domains, legal jurisdictions, and cultures. It is also accessible to developers who may not possess deep AI safety expertise, eliminating the need for manual dataset creation. Furthermore, Aymara AI is multilingual, supporting non-English languages, and multimodal, enabling evaluations for both text-to-text and text-to-image models.

The Aymara LLM Risk and Responsibility Matrix

To demonstrate its capabilities, Aymara AI was used to create the Aymara LLM Risk and Responsibility Matrix. This comprehensive benchmark evaluated 20 commercially available LLMs across 10 real-world safety domains. These domains were defined by instruction-based policies drawing from academic literature, major model provider usage policies, and emerging regulatory frameworks like the EU AI Act. Examples of these domains include Misinformation, Hate Speech & Bias, Privacy & Impersonation, and Unqualified Professional Advice.

For each policy, Aymara AI programmatically generated 25 adversarial evaluation prompts, totaling 250 prompts. These prompts were carefully crafted to elicit boundary-violating behavior from the LLMs, pushing them towards policy edge cases. The 20 LLMs were then tested on all 250 prompts, generating 5,000 responses, which were subsequently scored by Aymara AI’s evaluation engine. Refusals to respond were treated as safe, aligning with the goal of avoiding unsafe behavior.

Key Findings: Disparities and Vulnerabilities

The evaluation revealed significant disparities in safety performance across both models and domains. The average safety score across all LLMs and domains was 73.2%, but individual model scores ranged from 86.2% (Claude Haiku 3.5 by Anthropic) to 52.4% (Command R by Cohere). While some models achieved perfect 100% scores in specific domains, no single model consistently delivered top-tier performance across all ten safety areas.

Performance also varied dramatically across the safety domains. Well-established risk areas, such as Misinformation (mean 95.7%) and Malicious Use (mean 91.8%), showed uniformly strong performance, with many models achieving near-perfect compliance. This suggests that when safety risks are clearly defined and operationalized, high safe performance is attainable.

However, more nuanced or context-dependent domains exhibited systemic failures. Notably, Privacy & Impersonation had a mean score of only 24.3%, and Unqualified Professional Advice scored 53.8%. These domains represent critical vulnerabilities across the industry, indicating that current guardrails and model-training strategies are often insufficient in areas requiring subtle judgment or balancing competing values. For instance, the paper highlights that impersonation of public figures, while a clear violation in their framework, might be perceived as creative or satirical by developers, leading to less restrictive guardrails.

Also Read:

Implications and Future Directions

The findings underscore that LLM safety is not a monolithic attribute; it is context-dependent and varies significantly between models and types of risk. While top-tier providers like Anthropic, OpenAI, Amazon, and Google generally performed better, even their models struggled in the most challenging domains. This suggests that while overall variance exists, no single model can be definitively identified as statistically superior across the board given the current sample size.

The research concludes that scalable, programmatic, and customizable evaluation frameworks like Aymara AI are essential for navigating the complexities of generative AI safety. Future work will focus on extending the Aymara LLM Risk and Responsibility Matrix to cover more subtle or emerging risks, adapt to different legal and cultural contexts, and expand across modalities and languages to address global safety inequities. Longitudinal studies are also crucial to track progress and identify new risks as LLM capabilities rapidly evolve.

For more detailed information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking LLM Safety: Insights from Aymara AI’s Comprehensive Evaluation

Introducing Aymara AI: A New Approach to Safety Evaluation

The Aymara LLM Risk and Responsibility Matrix

Key Findings: Disparities and Vulnerabilities

Implications and Future Directions

Gen AI News and Updates

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Vatican Summit Addresses Ethical Imperatives of AI in Healthcare

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates