spot_img
HomeResearch & DevelopmentEvaluating How Large Language Models Justify Their Decisions

Evaluating How Large Language Models Justify Their Decisions

TLDR: A new framework called RACE (Reasoning Alignment for Completeness of Explanations) has been introduced to quantitatively assess how well Large Language Model (LLM) explanations align with the predictive signals of a transparent logistic regression model. The study found that correct LLM predictions consistently show higher coverage of supporting features, while incorrect predictions are linked to increased coverage of contradicting features. Using various matching techniques, including fuzzy edit-distance, the research revealed that LLMs both directly reuse and flexibly paraphrase key features, with the strength of this alignment varying across different text classification tasks.

As machine learning becomes more integrated into critical areas, the demand for transparent and understandable artificial intelligence (AI) has grown significantly. Large Language Models (LLMs) are now incredibly skilled at generating explanations in natural language for their decisions. However, a crucial question remains: do these explanations truly reflect the underlying information that drives the LLM’s predictions?

A new research paper introduces RACE—Reasoning Alignment for Completeness of Explanations—a systematic framework designed to tackle this very question. Authored by Avinash Patil from Juniper Networks Inc., this framework evaluates how well LLM-generated explanations align with interpretable feature importance scores derived from a logistic regression baseline model. You can read the full paper here.

Understanding RACE: Bridging LLM Explanations and Traditional Models

The core idea behind RACE is to compare the free-text explanations provided by an LLM with the most influential words or features identified by a simpler, more transparent model, specifically a logistic regression classifier. This traditional model, trained on the same data, can clearly show which lexical features (words or short phrases) strongly support or contradict a particular classification.

The framework works by first prompting an LLM (like DeepSeek-R1, used in this study) to make a prediction and provide a rationale. Simultaneously, a logistic regression model identifies its top-k most influential features, categorizing them as ‘supporting’ (positive influence) or ‘contradicting’ (negative influence) for the predicted class.

How Explanations Are Matched

To assess the alignment between LLM rationales and these identified features, RACE employs three distinct matching strategies:

  • Token-aware matching: This involves a lemma-level match after standard text normalization, like lowercasing and removing punctuation.
  • Exact string matching: A stricter method that requires an exact match of the feature string within the LLM’s explanation.
  • Edit-distance matching: A more flexible approach that allows for small character-level deviations, capturing paraphrases or near-synonymous overlaps. This is particularly useful for identifying when an LLM rephrases a key feature rather than using it verbatim.

By using these methods, RACE calculates a ‘coverage’ score, indicating how many of the identified supporting or contradicting features are present in the LLM’s explanation.

Key Findings: What the Research Revealed

The empirical study applied RACE across four widely used text classification datasets: WIKIONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS. The results uncovered several consistent and insightful patterns:

A consistent asymmetry was observed: when an LLM made a correct prediction, its rationales showed higher coverage of supporting features. Conversely, incorrect predictions were strongly associated with elevated coverage of contradicting features. This suggests that LLM explanations tend to highlight misleading evidence when errors occur.

The study also found that while exact and token-aware matching revealed significant surface-level overlap, edit-distance matching consistently boosted coverage. This indicates that LLM rationales often incorporate close variants or paraphrases of predictive features, demonstrating a mix of direct lexical alignment and flexible reformulation in their reasoning.

The strength of this alignment varied by task. Topical classification tasks (like WIKIONTOLOGY and AG NEWS) showed the clearest separation between correct and incorrect predictions, reflecting the strong lexical grounding of their categories. Sentiment analysis (IMDB) and fine-grained emotion recognition (GOEMOTIONS) exhibited weaker alignment, possibly because these tasks rely on more diffuse or subtle linguistic cues.

Also Read:

Implications for Trustworthy AI

These findings offer valuable insights into the faithfulness and limitations of LLM-generated rationales. They demonstrate that LLM explanations do capture semantically relevant evidence, but they can also amplify misleading cues in error cases. RACE provides a quantitative basis for evaluating reasoning completeness, highlighting that while LLMs often reuse or paraphrase key predictive features, the reliability of their explanations can be task-dependent and may sometimes reflect post-hoc justifications rather than genuine faithful reasoning.

The research suggests that future work should explore richer models of feature importance and incorporate contextual embeddings for semantic alignment to develop even more comprehensive measures of explanation faithfulness.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -