TLDR: SalaMAnder is a new framework that uses Shapley values to explain how Chain-of-Thought (CoT) reasoning works in large language models (LLMs) for math problems. It identifies the contribution of individual mathematical expressions within the reasoning steps and introduces an efficient method called SalaMA for calculation. The framework also proposes a metric called CoSP, which reliably correlates with an LLM’s performance, offering insights into optimizing prompt design and understanding why CoT is effective.
Large Language Models (LLMs) have shown remarkable improvements in solving complex mathematical problems when guided by Chain-of-Thought (CoT) reasoning. This approach involves breaking down a problem into a series of intermediate steps, much like a human would. However, despite its success, the exact mechanisms that make CoT so effective have remained somewhat of a mystery.
A new research paper, titled “SalaMAnder: Shapley-based Mathematical Expression Attribution and Metric for Chain-of-Thought Reasoning,” by Yue Xin, Chen Shen, Shaotian Yan, Xiaosong Yuan, Yaoming Wang, Xiaofeng Zhang, Chenxi Huang, and Jieping Ye, introduces a novel framework to shed light on this very question. The SalaMAnder framework provides a theoretically sound and mathematically rigorous way to quantify how much each part of a CoT reasoning process contributes to an LLM’s final answer.
Unpacking the Black Box: SalaMA and CoSP
At the heart of SalaMAnder are two key components: SalaMA and CoSP.
SalaMA (Shapley-based Mathematical Expression Attribution) is designed to attribute contributions within the CoT. Instead of looking at individual words or tokens, which can be ambiguous, SalaMA focuses on ‘mathematical expressions’ as the fundamental units. This approach ensures that the analysis is semantically meaningful and consistent. To overcome the significant computational challenge of calculating these contributions (known as Shapley values), the researchers developed an efficient stratified sampling algorithm. This innovation drastically reduces the computational time, making the analysis practical for real-world applications.
The contribution of each mathematical expression is determined by a ‘reward function’ that considers both the model’s confidence in its prediction and the correctness of the final answer. This balanced approach provides a sensitive and accurate measure of an expression’s impact.
CoSP (Cardinality of Shapley Positives) is a metric derived from the SalaMA process. It essentially counts the number of mathematical expressions within a CoT demonstration that have a positive average contribution to the model’s reasoning performance. It can also penalize expressions that have a negative or zero contribution. A higher CoSP score indicates that more of the reasoning steps are beneficial, suggesting better overall model performance.
The researchers have theoretically proven and experimentally validated that the CoSP metric has a strong, monotonic correlation with an LLM’s accuracy. This means that as CoSP increases, so does the model’s performance, providing a clear link between the quality of reasoning steps and the success of the model.
Real-World Validation and New Insights
The SalaMAnder framework was put to the test using popular LLMs like LLaMA-2-13B-chat, LLaMA-3-8B-Instruct, and Qwen2.5-7B-Instruct, across a variety of mathematical benchmarks including GSM8K, MathQA, and AQUA. The results consistently showed that CoSP-0 (a version of CoSP that only considers positive contributions) was the most effective metric for interpreting model performance.
Further experiments revealed fascinating insights into CoT reasoning. For instance, modifications to expressions identified as having low CoSP (i.e., minimal contribution) had a much smaller impact on the model’s overall performance compared to altering high CoSP expressions. This suggests that LLMs can sometimes be robust to minor errors or non-informative steps if the core, high-contributing reasoning elements are intact. The study also found that simply removing or replacing certain expressions, or even introducing small errors, didn’t always lead to a significant drop in performance, echoing findings from previous research.
Qualitative analysis showed that demonstrations with richer logical structures and more relevant intermediate steps tend to have higher CoSP values, leading to better model reasoning. Conversely, simpler computations that don’t convey a meaningful reasoning pattern result in lower positive contributions.
Also Read:
- Guiding Language Models for Better Tool Use and Clearer Decisions
- CogAtom: Building Advanced Math Problems to Elevate AI Reasoning
Implications for LLM Development
The SalaMAnder framework offers significant implications for the development and optimization of LLMs. By providing a clear understanding of which parts of a CoT are most effective, it offers rigorous guidelines for constructing better prompts. This could move prompt engineering from a trial-and-error process to a more principled, mathematically informed approach. It also provides a theoretical explanation for why existing few-shot CoT methods work, unifying insights from prior studies.
While currently focused on mathematical reasoning, the researchers aim to expand SalaMAnder’s application to a broader range of tasks in the future. This work represents a crucial step towards making LLM reasoning more transparent, interpretable, and ultimately, more powerful. You can read the full research paper here.


