Evaluating LLMs for Automated Ethical Reasoning in Software Engineering

TLDR: This research paper presents a fully automated framework to assess the ethical reasoning capabilities of 16 Large Language Models (LLMs) in a zero-shot setting. Using 30 real-world ethically charged scenarios, LLMs were prompted to identify applicable ethical theories, assess moral acceptability, and explain their reasoning. The results show high agreement rates (73.3% Theory Consistency Rate, 86.7% Binary Agreement Rate) among LLMs, with divergences concentrated in ethically ambiguous cases, often mirroring human expert disagreements. Qualitative analysis revealed lexically diverse but conceptually coherent and theory-aligned explanations. The findings suggest LLMs can serve as ethical inference engines in software engineering pipelines, enabling scalable, auditable, and adaptive integration of user-aligned ethical reasoning for tasks like decision auditing, autonomy triage, and agent personalization.

In today’s rapidly evolving digital landscape, autonomous systems are becoming an integral part of our daily lives, making decisions on our behalf. While these systems offer immense opportunities, they also present significant challenges, particularly in ensuring they operate ethically and align with human values. A recent research paper explores how Large Language Models (LLMs) can contribute to solving this complex problem by automating ethical profiling in software engineering.

The core challenge lies in designing software systems that not only meet technical requirements but also account for ethical considerations. Traditionally, ensuring ethical alignment often involves manual input from users to generate their ‘ethical profiles’ – structured representations of their ethical preferences. However, this manual approach is limited in scope and adaptability, as ethical preferences can vary greatly depending on the context. This makes it impractical to rely solely on users to provide input for every possible situation, highlighting the need for automation.

This is where Large Language Models come into play. Recent advancements in generative AI have positioned LLMs as powerful tools capable of engaging in ethical reasoning. The paper, titled “Advancing Automated Ethical Profiling in SE: a Zero-Shot Evaluation of LLM Reasoning,” investigates whether these models can effectively reason about ethically significant content in real-life scenarios. The authors, Patrizio Migliarini, Mashal Afzal Memon, Marco Autili, and Paola Inverardi, propose a lightweight, fully automated framework to evaluate this potential.

The study involved presenting 16 different LLMs with 30 ethically charged scenarios. For each scenario, the models were tasked with three things: identifying the most applicable ethical theory (from utilitarianism, deontology, or virtue ethics), assessing whether the action described was morally acceptable according to that theory (yes/no), and providing a brief explanation for their choice. To establish a baseline, the same process was replicated with three human experts – professors with extensive knowledge in applied ethics and philosophy.

The quantitative results were quite promising. LLMs achieved an average Theory Consistency Rate (TCR) of 73.3%, meaning they largely agreed on which ethical theory best applied to a scenario. Even more impressively, they showed an 86.7% Binary Agreement Rate (BAR) on moral acceptability. This indicates that LLMs can consistently interpret moral scenarios and produce structured, theory-informed judgments, even without specific fine-tuning for ethical tasks.

Interestingly, the study found that LLM agreement patterns often mirrored those of human experts. Scenarios that caused strong agreement among experts also tended to show high agreement among LLMs, and vice versa. This suggests that when LLMs disagree, it often reflects an inherent ethical ambiguity in the scenario itself, rather than just random noise. This insight is crucial for software engineering, as it implies that disagreements among LLMs could serve as a signal to escalate complex decisions for human review.

Beyond just numerical agreement, the researchers also conducted a qualitative analysis of the LLMs’ free-text explanations. They found that while the explanations were lexically diverse (meaning models used different words and phrasings), they were conceptually coherent. Over 90% of the time, the explanations consistently aligned with the ethical theory the model had selected. The models demonstrated a flexible use of moral vocabulary, often blending principles from different ethical traditions in a way that reflects human reasoning.

The implications for software engineering are significant. This research supports the potential viability of LLMs as ‘ethical inference engines’ within software development pipelines. They could be used for ‘decision auditing’ by generating clear, theory-grounded rationales for system actions, enhancing transparency. They could also enable ‘autonomy triage,’ where systems automatically handle straightforward ethical decisions but flag ambiguous cases for human oversight. Furthermore, the ability to extract consistent moral structures from language could lead to ‘agent personalization,’ allowing autonomous systems to adapt their behavior based on learned ethical user profiles.

While the findings are promising, the authors acknowledge certain limitations. The study focused on three major ethical theories, used concise and decontextualized scenarios, and employed a zero-shot approach without memory or clarification. Future work will explore broader ethical theories, richer input formats, and dynamic ethical profiling that adapts over time. The paper emphasizes that agreement does not equate to normative correctness, and LLM-based profiling should ideally be part of hybrid systems that combine automation with human oversight.

Also Read:

This work marks a significant step towards integrating sophisticated ethical reasoning into software systems, paving the way for more adaptive, traceable, and user-aligned ethical cognition in the digital world. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating LLMs for Automated Ethical Reasoning in Software Engineering

Gen AI News and Updates

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Vatican Summit Addresses Ethical Imperatives of AI in Healthcare

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates