Unpacking AI's Decisions: Can Large Language Models Explain Themselves?

TLDR: A study investigates if LLM-generated explanations can improve model classification performance. It finds that these automated explanations are competitive with human ones and consistently boost the performance of pre-trained language models (PLMs). However, for other LLMs, the impact is mixed; explanations can sometimes hinder performance, suggesting that the type of explanation and the model’s internal reasoning play a crucial role.

Large Language Models (LLMs) and other powerful language models have become incredibly adept at various tasks, but their inner workings often remain a mystery. This “black-box” nature makes it hard to understand why they make certain predictions, which is a significant hurdle, especially in sensitive applications. Traditionally, to make these models more transparent, researchers rely on human-written explanations. However, gathering these human explanations is a time-consuming, expensive, and labor-intensive process, making it difficult to scale up.

A recent study by researchers at the Technical University of Munich explores an innovative solution: using LLMs themselves to generate these crucial textual explanations. The paper, titled “Can LLM-Generated Textual Explanations Enhance Model Classification Performance? An Empirical Study,” investigates whether these automatically generated explanations can not only match the quality of human-written ones but also improve the performance of other language models on classification tasks. You can read the full research paper here.

Automating Explanation Generation

The researchers developed an automated framework that leverages several state-of-the-art LLMs to create high-quality textual explanations. They used a diverse set of models, including GPT-4o mini, Mixtral-7B, Gemma2-9B, and Llama3-70B, to generate explanations for two different Natural Language Inference (NLI) datasets: e-SNLI and HealthFC. NLI is a fundamental task in natural language processing where models determine the logical relationship between two pieces of text (a premise and a hypothesis).

Explanations were generated in two settings: “zero-shot,” where the LLMs received no examples, and “few-shot,” where they were given a few human-written examples to guide their generation. A key instruction given to the LLMs was to avoid hinting at the correct classification label in their explanations, ensuring the explanations were genuinely about reasoning rather than just repeating the answer.

Evaluating the Quality of LLM-Generated Explanations

To assess the quality of these LLM-generated explanations, the study employed a comprehensive suite of Natural Language Generation (NLG) metrics. These included traditional measures like BLEU and ROUGE, which look at word overlap, and more modern semantic metrics like BERTScore and MAUVE. They also used an “LLM-as-judge” framework called G-Eval, where another LLM (GPT-3.5-turbo in this case) evaluated the human-likeness, clarity, coherence, and structure of the generated explanations. The findings showed that GPT-4o mini generally produced the highest quality explanations for e-SNLI, while Llama3-70B performed best on HealthFC. Interestingly, the study found that providing a few examples (few-shot setting) only marginally improved the explanation quality, and larger model size didn’t always guarantee better explanations.

Impact on Model Performance: PLMs vs. LLMs

The core of the study investigated how incorporating these LLM-generated explanations affected the performance of both pre-trained language models (PLMs) like BERT, DeBERTa, RoBERTa, and ModernBERT, and other LLMs (GPT-4o mini, Qwen 2.5, and Llama3.3-70B) on NLI tasks.

For PLMs, the results were largely positive: both human-generated and LLM-generated explanations consistently improved predictive performance compared to having no explanations. On the HealthFC dataset, LLM-generated explanations even led to better performance than human explanations in some cases. This suggests that PLMs, when fine-tuned with these explanations, can effectively learn and utilize the additional information provided.

However, the impact on classifier LLMs was more nuanced. In most scenarios, providing LLM-generated explanations did not lead to better performance for the LLM classifiers compared to the baseline without explanations. In fact, for the e-SNLI dataset, these explanations often hurt performance. The researchers suggest this might be because the LLMs used as classifiers were not explicitly trained on the explanations and their internal reasoning processes (akin to Chain-of-Thought) might clash with the provided external explanations. Human explanations were generally more beneficial for LLMs, though their effectiveness still varied significantly between datasets and models.

Different Types of Explanations Matter

The study also highlighted that the type of explanation plays a crucial role. The e-SNLI dataset features “logic-based” explanations that clarify the reasoning process (e.g., “The person is standing, therefore they cannot be sitting”). The HealthFC dataset, on the other hand, uses “summary-style” explanations that provide additional context and background knowledge (e.g., “Analyzed studies have found a positive effect of the drug on the illness”). The logic-based explanations seemed to interfere with the LLMs’ inherent reasoning, while the summary-style explanations were more helpful, providing useful context.

Also Read:

Conclusion and Future Directions

This research demonstrates the significant potential of using LLMs to automatically generate textual explanations. These automated rationales can be competitive with human annotations and can notably improve the performance of PLMs. While their impact on other LLMs is more complex and task-dependent, the study opens new avenues for augmenting datasets with explanations and enhancing model classification performance. Future work will explore a wider range of datasets, refine prompt engineering, and incorporate newer evaluation metrics to further validate and expand on these findings.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI’s Decisions: Can Large Language Models Explain Themselves?

Automating Explanation Generation

Evaluating the Quality of LLM-Generated Explanations

Impact on Model Performance: PLMs vs. LLMs

Different Types of Explanations Matter

Conclusion and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates