TLDR: A new framework called RACE (Reasoning Alignment for Completeness of Explanations) has been introduced to quantitatively assess how well Large Language Model (LLM) explanations align with the predictive signals of a transparent logistic regression model. The study found that correct LLM predictions consistently show higher coverage of supporting features, while incorrect predictions are linked to increased coverage of contradicting features. Using various matching techniques, including fuzzy edit-distance, the research revealed that LLMs both directly reuse and flexibly paraphrase key features, with the strength of this alignment varying across different text classification tasks.
As machine learning becomes more integrated into critical areas, the demand for transparent and understandable artificial intelligence (AI) has grown significantly. Large Language Models (LLMs) are now incredibly skilled at generating explanations in natural language for their decisions. However, a crucial question remains: do these explanations truly reflect the underlying information that drives the LLM’s predictions?
A new research paper introduces RACE—Reasoning Alignment for Completeness of Explanations—a systematic framework designed to tackle this very question. Authored by Avinash Patil from Juniper Networks Inc., this framework evaluates how well LLM-generated explanations align with interpretable feature importance scores derived from a logistic regression baseline model. You can read the full paper here.
Understanding RACE: Bridging LLM Explanations and Traditional Models
The core idea behind RACE is to compare the free-text explanations provided by an LLM with the most influential words or features identified by a simpler, more transparent model, specifically a logistic regression classifier. This traditional model, trained on the same data, can clearly show which lexical features (words or short phrases) strongly support or contradict a particular classification.
The framework works by first prompting an LLM (like DeepSeek-R1, used in this study) to make a prediction and provide a rationale. Simultaneously, a logistic regression model identifies its top-k most influential features, categorizing them as ‘supporting’ (positive influence) or ‘contradicting’ (negative influence) for the predicted class.
How Explanations Are Matched
To assess the alignment between LLM rationales and these identified features, RACE employs three distinct matching strategies:
- Token-aware matching: This involves a lemma-level match after standard text normalization, like lowercasing and removing punctuation.
- Exact string matching: A stricter method that requires an exact match of the feature string within the LLM’s explanation.
- Edit-distance matching: A more flexible approach that allows for small character-level deviations, capturing paraphrases or near-synonymous overlaps. This is particularly useful for identifying when an LLM rephrases a key feature rather than using it verbatim.
By using these methods, RACE calculates a ‘coverage’ score, indicating how many of the identified supporting or contradicting features are present in the LLM’s explanation.
Key Findings: What the Research Revealed
The empirical study applied RACE across four widely used text classification datasets: WIKIONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS. The results uncovered several consistent and insightful patterns:
A consistent asymmetry was observed: when an LLM made a correct prediction, its rationales showed higher coverage of supporting features. Conversely, incorrect predictions were strongly associated with elevated coverage of contradicting features. This suggests that LLM explanations tend to highlight misleading evidence when errors occur.
The study also found that while exact and token-aware matching revealed significant surface-level overlap, edit-distance matching consistently boosted coverage. This indicates that LLM rationales often incorporate close variants or paraphrases of predictive features, demonstrating a mix of direct lexical alignment and flexible reformulation in their reasoning.
The strength of this alignment varied by task. Topical classification tasks (like WIKIONTOLOGY and AG NEWS) showed the clearest separation between correct and incorrect predictions, reflecting the strong lexical grounding of their categories. Sentiment analysis (IMDB) and fine-grained emotion recognition (GOEMOTIONS) exhibited weaker alignment, possibly because these tasks rely on more diffuse or subtle linguistic cues.
Also Read:
- Bridging the Gap: How Symbolic AI Enhances Transparency and Reasoning in Large Language Models
- Enhancing Trust in Multimodal AI Through Consistent Emotional Explanations
Implications for Trustworthy AI
These findings offer valuable insights into the faithfulness and limitations of LLM-generated rationales. They demonstrate that LLM explanations do capture semantically relevant evidence, but they can also amplify misleading cues in error cases. RACE provides a quantitative basis for evaluating reasoning completeness, highlighting that while LLMs often reuse or paraphrase key predictive features, the reliability of their explanations can be task-dependent and may sometimes reflect post-hoc justifications rather than genuine faithful reasoning.
The research suggests that future work should explore richer models of feature importance and incorporate contextual embeddings for semantic alignment to develop even more comprehensive measures of explanation faithfulness.


