TLDR: A new research paper explores the relationship between Chain-of-Thought (CoT) reasoning traces in Large Language Models (LLMs) and their interpretability to humans. The study found a significant disconnect: while complex DeepSeek R1 traces led to the highest LLM performance, they were rated as the least interpretable by human participants. Conversely, human-friendly, algorithmically generated traces resulted in weaker LLM performance. This suggests that what makes an LLM perform well is not necessarily what makes its reasoning understandable to people, advocating for separate approaches to optimize performance and user interpretability.
A recent study delves into a fascinating paradox at the heart of Large Language Models (LLMs) and their reasoning processes: Do the complex ‘Chain-of-Thought’ (CoT) traces that help LLMs perform better also make them easier for humans to understand? The findings suggest a surprising disconnect, indicating that what makes an AI powerful isn’t necessarily what makes it transparent to us.
The research, titled “Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?”, was conducted by Siddhant Bhambri, Upasana Biswas, and Subbarao Kambhampati from Arizona State University. Their work challenges a common, often implicit, assumption that the intermediate reasoning steps generated by LLMs should be semantically meaningful and interpretable to end-users.
The Role of Reasoning Traces in LLMs
Chain-of-Thought (CoT) traces are intermediate steps that LLMs generate before arriving at a final answer. These traces have been instrumental in boosting the performance of models like DeepSeek R1 across various tasks. They serve not only to guide the model’s inference but also as crucial signals for training smaller models through a process called Supervised Fine-Tuning (SFT).
However, the researchers questioned whether these traces *must* be interpretable to improve an LLM’s task performance. They explored this question within the domain of Open Book Question-Answering, using LLaMA and Qwen models.
Investigating Different Trace Types
The study involved fine-tuning LLMs on four distinct types of reasoning traces:
- DeepSeek R1 traces: The original, often verbose, traces generated by the DeepSeek R1 model.
- LLM-generated summaries of R1 traces: More concise versions created by another LLM (GPT-4o-mini).
- LLM-generated post-hoc explanations of R1 traces: Explanations of the R1 traces, also generated by GPT-4o-mini.
- Algorithmically generated verifiably correct traces: Traces that are semantically correct and derived directly from the provided facts.
To measure the human perspective, a human-subject study was conducted with 125 participants. These participants rated the interpretability of each trace type based on attributes like predictability, comprehensibility, and faithfulness, and also assessed the cognitive workload involved in understanding them.
A Striking Mismatch
The results revealed a significant and counter-intuitive finding. While fine-tuning LLMs on the original DeepSeek R1 traces consistently led to the strongest performance improvements in the models, these very traces were judged by human participants to be the *least* interpretable. Users found R1 traces to be the most mentally demanding, requiring more effort and causing more frustration to understand.
Conversely, the algorithmically generated verifiably correct traces were rated as the most interpretable by humans, being easiest to follow and comprehend with the lowest cognitive workload. Yet, these traces yielded the weakest improvements in the LLMs’ task accuracy. Summarized and post-hoc explained R1 traces fell in the middle, offering a better balance of interpretability than raw R1 traces but still not matching the performance boost of the original R1 traces.
Also Read:
- AI’s “Chain of Thought” Reasoning Deemed a “Brittle Mirage” by Researchers
- Evaluating Language Models on Logical Reasoning: The Challenge of Natural Language Satisfiability
Implications for Future AI Development
These findings highlight a crucial distinction: the internal ‘reasoning’ that helps an LLM achieve high performance is not necessarily the same as what humans find understandable or cognitively interpretable. The verbose and complex nature of R1 traces, while providing rich training signals for models, seems to be poorly aligned with human expectations for clarity and interpretability.
The researchers conclude that it is useful to decouple the intermediate tokens used by LLMs from the interpretability required by end-users. This suggests two key takeaways for the future of AI: CoT-style traces should primarily be optimized for improving model performance, and separate, dedicated efforts should focus on generating user-friendly explanations for the model’s answers. This could lead to AI systems that are both highly capable and genuinely understandable to humans.
You can read the full research paper here: Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?


