Enhancing LLM Numerical Reasoning in Tables with a Decompose-Sanitize-Reason Framework

TLDR: TABDSR is a new framework that significantly improves Large Language Models’ (LLMs) ability to perform complex numerical reasoning on tabular data. It works by breaking down complex questions, cleaning messy tables, and then generating executable code to calculate answers. The framework consistently outperforms existing methods and is effective across various LLMs, even powerful ones, demonstrating its robust approach to handling real-world data challenges.

Large Language Models (LLMs) are powerful, but they often struggle with complex numerical reasoning, especially when dealing with real-world tabular data. This is a significant challenge because analyzing tables is crucial in many fields, from finance to healthcare. The difficulties arise from complex questions, messy data, and the LLMs’ inherent limitations in performing precise calculations.

To tackle these issues, researchers have introduced a new framework called TABDSR. This innovative system aims to improve how LLMs handle numerical reasoning tasks involving tables. TABDSR is designed to mimic how humans approach complex problems, breaking them down into manageable steps.

How TABDSR Works: A Three-Agent Approach

The TABDSR framework operates through a collaborative pipeline of three specialized “agents”:

Query Decomposer Agent: Complex questions often require multiple steps of reasoning. Instead of trying to answer everything at once, this agent breaks down a complicated question into simpler, more manageable sub-questions. It focuses solely on the question, ignoring the table initially, to ensure that important details in the query are not overlooked.
Table Sanitizer Agent: Real-world tables are rarely perfectly clean. They can have messy formats, missing values, or non-numeric characters mixed with numbers (like “1.24(approx)”). This agent cleans and organizes the table data. It handles issues like merging multi-level headers, identifying and removing irrelevant rows, and cleaning cell content by removing symbols and converting data to consistent numerical formats. It even has a reflection mechanism to correct its own cleaning errors.
PoT-based Reasoner Agent: Once the question is broken down and the table is clean, this agent steps in. It uses a “Program-of-Thought” (PoT) approach, which means it generates executable Python code (specifically using the Pandas library) to perform the necessary calculations on the sanitized table. This bypasses the LLMs’ weakness in direct numerical computation, ensuring accurate results for each sub-question. The answers from these sub-questions are then logically reassembled to provide the final answer to the original complex query.

Addressing Data Leakage with CALTAB151

One significant challenge in evaluating TQA methods is “data leakage” in existing datasets, where models might perform well simply because they’ve seen similar data during training. To provide a more unbiased and reliable evaluation, the researchers developed a new dataset called CALTAB151. This dataset is specifically designed for complex numerical reasoning over tables and was created using a unique annotation framework that combines LLM-generated queries with human-verified answers. It includes various challenges like numerical perturbations, cell noise, structural randomization, and multi-hop questions, making it a robust benchmark for testing true reasoning capabilities.

Impressive Performance and Transferability

Experiments show that TABDSR consistently outperforms existing methods across various benchmarks, including TAT-QA, TableBench, and the new CALTAB151 dataset. It achieved significant accuracy improvements, demonstrating its effectiveness in enhancing LLM performance for complex tabular numerical reasoning. What’s more, TABDSR proved to be highly transferable. Even when applied to powerful LLMs like GPT-4o and DeepSeek-V3, it continued to improve their performance, indicating that it offers genuine enhancements rather than just compensating for weaker models.

Also Read:

Conclusion

TABDSR offers a robust and practical solution for improving numerical reasoning in Table Question Answering. By systematically decomposing complex questions and sanitizing noisy tabular data, it allows LLMs to better utilize their inherent reasoning capabilities. This prompt-based framework is easy to adopt, reducing the need for extensive data annotation or specialized training, and has broad implications for real-world applications in finance, business intelligence, healthcare, and e-commerce. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing LLM Numerical Reasoning in Tables with a Decompose-Sanitize-Reason Framework

How TABDSR Works: A Three-Agent Approach

Addressing Data Leakage with CALTAB151

Impressive Performance and Transferability

Conclusion

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates