SalaMAnder: A Deeper Look into How LLMs Solve Math Problems

TLDR: SalaMAnder is a new framework that uses Shapley values to explain how Chain-of-Thought (CoT) reasoning works in large language models (LLMs) for math problems. It identifies the contribution of individual mathematical expressions within the reasoning steps and introduces an efficient method called SalaMA for calculation. The framework also proposes a metric called CoSP, which reliably correlates with an LLM’s performance, offering insights into optimizing prompt design and understanding why CoT is effective.

Large Language Models (LLMs) have shown remarkable improvements in solving complex mathematical problems when guided by Chain-of-Thought (CoT) reasoning. This approach involves breaking down a problem into a series of intermediate steps, much like a human would. However, despite its success, the exact mechanisms that make CoT so effective have remained somewhat of a mystery.

A new research paper, titled “SalaMAnder: Shapley-based Mathematical Expression Attribution and Metric for Chain-of-Thought Reasoning,” by Yue Xin, Chen Shen, Shaotian Yan, Xiaosong Yuan, Yaoming Wang, Xiaofeng Zhang, Chenxi Huang, and Jieping Ye, introduces a novel framework to shed light on this very question. The SalaMAnder framework provides a theoretically sound and mathematically rigorous way to quantify how much each part of a CoT reasoning process contributes to an LLM’s final answer.

Unpacking the Black Box: SalaMA and CoSP

At the heart of SalaMAnder are two key components: SalaMA and CoSP.

SalaMA (Shapley-based Mathematical Expression Attribution) is designed to attribute contributions within the CoT. Instead of looking at individual words or tokens, which can be ambiguous, SalaMA focuses on ‘mathematical expressions’ as the fundamental units. This approach ensures that the analysis is semantically meaningful and consistent. To overcome the significant computational challenge of calculating these contributions (known as Shapley values), the researchers developed an efficient stratified sampling algorithm. This innovation drastically reduces the computational time, making the analysis practical for real-world applications.

The contribution of each mathematical expression is determined by a ‘reward function’ that considers both the model’s confidence in its prediction and the correctness of the final answer. This balanced approach provides a sensitive and accurate measure of an expression’s impact.

CoSP (Cardinality of Shapley Positives) is a metric derived from the SalaMA process. It essentially counts the number of mathematical expressions within a CoT demonstration that have a positive average contribution to the model’s reasoning performance. It can also penalize expressions that have a negative or zero contribution. A higher CoSP score indicates that more of the reasoning steps are beneficial, suggesting better overall model performance.

The researchers have theoretically proven and experimentally validated that the CoSP metric has a strong, monotonic correlation with an LLM’s accuracy. This means that as CoSP increases, so does the model’s performance, providing a clear link between the quality of reasoning steps and the success of the model.

Real-World Validation and New Insights

The SalaMAnder framework was put to the test using popular LLMs like LLaMA-2-13B-chat, LLaMA-3-8B-Instruct, and Qwen2.5-7B-Instruct, across a variety of mathematical benchmarks including GSM8K, MathQA, and AQUA. The results consistently showed that CoSP-0 (a version of CoSP that only considers positive contributions) was the most effective metric for interpreting model performance.

Further experiments revealed fascinating insights into CoT reasoning. For instance, modifications to expressions identified as having low CoSP (i.e., minimal contribution) had a much smaller impact on the model’s overall performance compared to altering high CoSP expressions. This suggests that LLMs can sometimes be robust to minor errors or non-informative steps if the core, high-contributing reasoning elements are intact. The study also found that simply removing or replacing certain expressions, or even introducing small errors, didn’t always lead to a significant drop in performance, echoing findings from previous research.

Qualitative analysis showed that demonstrations with richer logical structures and more relevant intermediate steps tend to have higher CoSP values, leading to better model reasoning. Conversely, simpler computations that don’t convey a meaningful reasoning pattern result in lower positive contributions.

Also Read:

Implications for LLM Development

The SalaMAnder framework offers significant implications for the development and optimization of LLMs. By providing a clear understanding of which parts of a CoT are most effective, it offers rigorous guidelines for constructing better prompts. This could move prompt engineering from a trial-and-error process to a more principled, mathematically informed approach. It also provides a theoretical explanation for why existing few-shot CoT methods work, unifying insights from prior studies.

While currently focused on mathematical reasoning, the researchers aim to expand SalaMAnder’s application to a broader range of tasks in the future. This work represents a crucial step towards making LLM reasoning more transparent, interpretable, and ultimately, more powerful. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

SalaMAnder: A Deeper Look into How LLMs Solve Math Problems

Unpacking the Black Box: SalaMA and CoSP

Real-World Validation and New Insights

Implications for LLM Development

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates