Evaluating How Large Language Models Justify Their Decisions

TLDR: A new framework called RACE (Reasoning Alignment for Completeness of Explanations) has been introduced to quantitatively assess how well Large Language Model (LLM) explanations align with the predictive signals of a transparent logistic regression model. The study found that correct LLM predictions consistently show higher coverage of supporting features, while incorrect predictions are linked to increased coverage of contradicting features. Using various matching techniques, including fuzzy edit-distance, the research revealed that LLMs both directly reuse and flexibly paraphrase key features, with the strength of this alignment varying across different text classification tasks.

As machine learning becomes more integrated into critical areas, the demand for transparent and understandable artificial intelligence (AI) has grown significantly. Large Language Models (LLMs) are now incredibly skilled at generating explanations in natural language for their decisions. However, a crucial question remains: do these explanations truly reflect the underlying information that drives the LLM’s predictions?

A new research paper introduces RACE—Reasoning Alignment for Completeness of Explanations—a systematic framework designed to tackle this very question. Authored by Avinash Patil from Juniper Networks Inc., this framework evaluates how well LLM-generated explanations align with interpretable feature importance scores derived from a logistic regression baseline model. You can read the full paper here.

Understanding RACE: Bridging LLM Explanations and Traditional Models

The core idea behind RACE is to compare the free-text explanations provided by an LLM with the most influential words or features identified by a simpler, more transparent model, specifically a logistic regression classifier. This traditional model, trained on the same data, can clearly show which lexical features (words or short phrases) strongly support or contradict a particular classification.

The framework works by first prompting an LLM (like DeepSeek-R1, used in this study) to make a prediction and provide a rationale. Simultaneously, a logistic regression model identifies its top-k most influential features, categorizing them as ‘supporting’ (positive influence) or ‘contradicting’ (negative influence) for the predicted class.

How Explanations Are Matched

To assess the alignment between LLM rationales and these identified features, RACE employs three distinct matching strategies:

Token-aware matching: This involves a lemma-level match after standard text normalization, like lowercasing and removing punctuation.
Exact string matching: A stricter method that requires an exact match of the feature string within the LLM’s explanation.
Edit-distance matching: A more flexible approach that allows for small character-level deviations, capturing paraphrases or near-synonymous overlaps. This is particularly useful for identifying when an LLM rephrases a key feature rather than using it verbatim.

By using these methods, RACE calculates a ‘coverage’ score, indicating how many of the identified supporting or contradicting features are present in the LLM’s explanation.

Key Findings: What the Research Revealed

The empirical study applied RACE across four widely used text classification datasets: WIKIONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS. The results uncovered several consistent and insightful patterns:

A consistent asymmetry was observed: when an LLM made a correct prediction, its rationales showed higher coverage of supporting features. Conversely, incorrect predictions were strongly associated with elevated coverage of contradicting features. This suggests that LLM explanations tend to highlight misleading evidence when errors occur.

The study also found that while exact and token-aware matching revealed significant surface-level overlap, edit-distance matching consistently boosted coverage. This indicates that LLM rationales often incorporate close variants or paraphrases of predictive features, demonstrating a mix of direct lexical alignment and flexible reformulation in their reasoning.

The strength of this alignment varied by task. Topical classification tasks (like WIKIONTOLOGY and AG NEWS) showed the clearest separation between correct and incorrect predictions, reflecting the strong lexical grounding of their categories. Sentiment analysis (IMDB) and fine-grained emotion recognition (GOEMOTIONS) exhibited weaker alignment, possibly because these tasks rely on more diffuse or subtle linguistic cues.

Also Read:

Implications for Trustworthy AI

These findings offer valuable insights into the faithfulness and limitations of LLM-generated rationales. They demonstrate that LLM explanations do capture semantically relevant evidence, but they can also amplify misleading cues in error cases. RACE provides a quantitative basis for evaluating reasoning completeness, highlighting that while LLMs often reuse or paraphrase key predictive features, the reliability of their explanations can be task-dependent and may sometimes reflect post-hoc justifications rather than genuine faithful reasoning.

The research suggests that future work should explore richer models of feature importance and incorporate contextual embeddings for semantic alignment to develop even more comprehensive measures of explanation faithfulness.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating How Large Language Models Justify Their Decisions

Understanding RACE: Bridging LLM Explanations and Traditional Models

How Explanations Are Matched

Key Findings: What the Research Revealed

Implications for Trustworthy AI

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates