TLDR: TABDSR is a new framework that significantly improves Large Language Models’ (LLMs) ability to perform complex numerical reasoning on tabular data. It works by breaking down complex questions, cleaning messy tables, and then generating executable code to calculate answers. The framework consistently outperforms existing methods and is effective across various LLMs, even powerful ones, demonstrating its robust approach to handling real-world data challenges.
Large Language Models (LLMs) are powerful, but they often struggle with complex numerical reasoning, especially when dealing with real-world tabular data. This is a significant challenge because analyzing tables is crucial in many fields, from finance to healthcare. The difficulties arise from complex questions, messy data, and the LLMs’ inherent limitations in performing precise calculations.
To tackle these issues, researchers have introduced a new framework called TABDSR. This innovative system aims to improve how LLMs handle numerical reasoning tasks involving tables. TABDSR is designed to mimic how humans approach complex problems, breaking them down into manageable steps.
How TABDSR Works: A Three-Agent Approach
The TABDSR framework operates through a collaborative pipeline of three specialized “agents”:
- Query Decomposer Agent: Complex questions often require multiple steps of reasoning. Instead of trying to answer everything at once, this agent breaks down a complicated question into simpler, more manageable sub-questions. It focuses solely on the question, ignoring the table initially, to ensure that important details in the query are not overlooked.
- Table Sanitizer Agent: Real-world tables are rarely perfectly clean. They can have messy formats, missing values, or non-numeric characters mixed with numbers (like “1.24(approx)”). This agent cleans and organizes the table data. It handles issues like merging multi-level headers, identifying and removing irrelevant rows, and cleaning cell content by removing symbols and converting data to consistent numerical formats. It even has a reflection mechanism to correct its own cleaning errors.
- PoT-based Reasoner Agent: Once the question is broken down and the table is clean, this agent steps in. It uses a “Program-of-Thought” (PoT) approach, which means it generates executable Python code (specifically using the Pandas library) to perform the necessary calculations on the sanitized table. This bypasses the LLMs’ weakness in direct numerical computation, ensuring accurate results for each sub-question. The answers from these sub-questions are then logically reassembled to provide the final answer to the original complex query.
Addressing Data Leakage with CALTAB151
One significant challenge in evaluating TQA methods is “data leakage” in existing datasets, where models might perform well simply because they’ve seen similar data during training. To provide a more unbiased and reliable evaluation, the researchers developed a new dataset called CALTAB151. This dataset is specifically designed for complex numerical reasoning over tables and was created using a unique annotation framework that combines LLM-generated queries with human-verified answers. It includes various challenges like numerical perturbations, cell noise, structural randomization, and multi-hop questions, making it a robust benchmark for testing true reasoning capabilities.
Impressive Performance and Transferability
Experiments show that TABDSR consistently outperforms existing methods across various benchmarks, including TAT-QA, TableBench, and the new CALTAB151 dataset. It achieved significant accuracy improvements, demonstrating its effectiveness in enhancing LLM performance for complex tabular numerical reasoning. What’s more, TABDSR proved to be highly transferable. Even when applied to powerful LLMs like GPT-4o and DeepSeek-V3, it continued to improve their performance, indicating that it offers genuine enhancements rather than just compensating for weaker models.
Also Read:
- Adaptive Reasoning for Language Models: Balancing Performance and Cost
- DecompSR: Unpacking How Language Models Reason About Space
Conclusion
TABDSR offers a robust and practical solution for improving numerical reasoning in Table Question Answering. By systematically decomposing complex questions and sanitizing noisy tabular data, it allows LLMs to better utilize their inherent reasoning capabilities. This prompt-based framework is easy to adopt, reducing the need for extensive data annotation or specialized training, and has broad implications for real-world applications in finance, business intelligence, healthcare, and e-commerce. For more details, you can read the full research paper here.


