TLDR: Aymara AI introduces a platform for automated, customizable safety evaluations of large language models (LLMs). Their “Aymara LLM Risk and Responsibility Matrix” evaluated 20 commercial LLMs across 10 safety domains, revealing significant performance disparities. While LLMs performed well in established areas like misinformation, they consistently struggled with complex domains such as privacy and impersonation, highlighting the inconsistent nature of LLM safety and the need for adaptable evaluation tools.
As large language models (LLMs) become increasingly integrated into various real-world applications, ensuring their safety and responsible deployment is paramount. However, evaluating AI safety is a complex and multifaceted challenge, as what constitutes ‘safe’ behavior can vary significantly based on the application domain, legal frameworks, cultural norms, and evolving threat landscapes.
A new research paper, titled “AUTOMATED SAFETY EVALUATIONS ACROSS 20 LARGE LANGUAGE MODELS: THE AYMARA LLM RISK AND RESPONSIBILITY MATRIX” by Juan Manuel Contreras of Aymara, introduces a novel approach to address this challenge. The paper details Aymara AI, a programmatic platform designed to generate and administer customized, policy-grounded safety evaluations for generative AI models.
Introducing Aymara AI: A New Approach to Safety Evaluation
Aymara AI works by transforming natural-language safety policies into adversarial prompts. These prompts are then fed to LLMs, and their responses are scored by an AI-based rater, which has been validated against human judgments. This innovative system aims to provide a scalable and rigorous method for assessing LLM safety, moving beyond static benchmarks to offer dynamic, customizable evaluations.
The platform is designed to be highly adaptable, supporting diverse and evolving definitions of safety across different domains, legal jurisdictions, and cultures. It is also accessible to developers who may not possess deep AI safety expertise, eliminating the need for manual dataset creation. Furthermore, Aymara AI is multilingual, supporting non-English languages, and multimodal, enabling evaluations for both text-to-text and text-to-image models.
The Aymara LLM Risk and Responsibility Matrix
To demonstrate its capabilities, Aymara AI was used to create the Aymara LLM Risk and Responsibility Matrix. This comprehensive benchmark evaluated 20 commercially available LLMs across 10 real-world safety domains. These domains were defined by instruction-based policies drawing from academic literature, major model provider usage policies, and emerging regulatory frameworks like the EU AI Act. Examples of these domains include Misinformation, Hate Speech & Bias, Privacy & Impersonation, and Unqualified Professional Advice.
For each policy, Aymara AI programmatically generated 25 adversarial evaluation prompts, totaling 250 prompts. These prompts were carefully crafted to elicit boundary-violating behavior from the LLMs, pushing them towards policy edge cases. The 20 LLMs were then tested on all 250 prompts, generating 5,000 responses, which were subsequently scored by Aymara AI’s evaluation engine. Refusals to respond were treated as safe, aligning with the goal of avoiding unsafe behavior.
Key Findings: Disparities and Vulnerabilities
The evaluation revealed significant disparities in safety performance across both models and domains. The average safety score across all LLMs and domains was 73.2%, but individual model scores ranged from 86.2% (Claude Haiku 3.5 by Anthropic) to 52.4% (Command R by Cohere). While some models achieved perfect 100% scores in specific domains, no single model consistently delivered top-tier performance across all ten safety areas.
Performance also varied dramatically across the safety domains. Well-established risk areas, such as Misinformation (mean 95.7%) and Malicious Use (mean 91.8%), showed uniformly strong performance, with many models achieving near-perfect compliance. This suggests that when safety risks are clearly defined and operationalized, high safe performance is attainable.
However, more nuanced or context-dependent domains exhibited systemic failures. Notably, Privacy & Impersonation had a mean score of only 24.3%, and Unqualified Professional Advice scored 53.8%. These domains represent critical vulnerabilities across the industry, indicating that current guardrails and model-training strategies are often insufficient in areas requiring subtle judgment or balancing competing values. For instance, the paper highlights that impersonation of public figures, while a clear violation in their framework, might be perceived as creative or satirical by developers, leading to less restrictive guardrails.
Also Read:
- Large Language Models: A New Frontier in Cybersecurity
- AI’s Achilles’ Heel: Why More Feedback Can Harm Large Language Model Performance
Implications and Future Directions
The findings underscore that LLM safety is not a monolithic attribute; it is context-dependent and varies significantly between models and types of risk. While top-tier providers like Anthropic, OpenAI, Amazon, and Google generally performed better, even their models struggled in the most challenging domains. This suggests that while overall variance exists, no single model can be definitively identified as statistically superior across the board given the current sample size.
The research concludes that scalable, programmatic, and customizable evaluation frameworks like Aymara AI are essential for navigating the complexities of generative AI safety. Future work will focus on extending the Aymara LLM Risk and Responsibility Matrix to cover more subtle or emerging risks, adapt to different legal and cultural contexts, and expand across modalities and languages to address global safety inequities. Longitudinal studies are also crucial to track progress and identify new risks as LLM capabilities rapidly evolve.
For more detailed information, you can read the full research paper here.


