THOR: Bridging LLM Reasoning and Precise Computation for Math Problems

TLDR: THOR is a new framework that enhances Large Language Models’ (LLMs) mathematical reasoning by integrating external tools. It addresses key challenges in tool-integrated reasoning (TIR) through three main components: TIRGen, a data pipeline for creating high-quality tool-use data; a hierarchical reinforcement learning strategy that optimizes both overall problem-solving and specific code generation steps; and a self-correction mechanism during inference that uses immediate tool feedback to fix errors. THOR achieves state-of-the-art performance on various mathematical and code benchmarks, demonstrating strong generalization and efficiency.

Large Language Models (LLMs) have shown incredible advancements in many areas, including mathematical reasoning. However, they often struggle with tasks requiring high precision, such as complex numerical calculations or formal symbolic manipulations. This is where integrating external tools, like code interpreters, becomes crucial to bridge the gap between LLM’s reasoning capabilities and the need for exact computation.

Despite recent progress in combining LLMs with tools, researchers have faced three main hurdles: creating high-quality datasets for tool-integrated reasoning, optimizing models at a very detailed level, and improving how models use tools during inference. A new framework called THOR (Tool-Integrated Hierarchical Optimization via RL) has been proposed to tackle these challenges.

Building Better Tool-Integrated Data with TIRGen

One of THOR’s core innovations is TIRGen, a multi-agent pipeline designed to construct high-quality datasets of tool-integrated reasoning paths. Think of it as a collaborative effort between two AI agents: an ‘Actor’ that generates natural language reasoning steps, and a ‘Critic’ that identifies which of these steps can be solved using code. The Critic then converts these parts into executable Python code, runs it, and uses the precise results to refine the reasoning path. This iterative process creates a dataset that is well-aligned with how the model actually thinks and uses tools, making it highly effective for training.

Hierarchical Learning for Precision and Problem Solving

THOR introduces a sophisticated reinforcement learning (RL) strategy for fine-grained optimization. The key insight here is that the success of an intermediate tool call is a strong indicator of whether the final answer will be correct. Based on this, THOR optimizes on two levels:

Trajectory-level Optimization: This focuses on the overall problem-solving ability, rewarding the model for generating correct final answers to mathematical problems.
Step-level Optimization: This is a more granular approach, specifically targeting and correcting errors in code generation steps. If a tool call fails, the model learns to improve its code generation for similar situations, directly enhancing its precision.

Self-Correction for Robust Inference

During the inference phase (when the model is solving new problems), THOR incorporates a self-correction mechanism. If a tool call fails, the model doesn’t just give up. Instead, it uses the immediate feedback from the failed execution to dynamically revise its reasoning path. It can backtrack to the problematic step and regenerate a new reasoning suffix and action, exploring alternative solutions until a successful path is found. This significantly boosts the model’s robustness and overall performance, ensuring it can recover from errors on the fly.

Also Read:

Impressive Performance and Generalization

THOR has been rigorously evaluated on a wide range of challenging mathematical benchmarks, including MATH500, AIME, AMC, Minerva Math, and Olympiad Bench. It has achieved state-of-the-art performance among models of comparable size, demonstrating strong generalization across both reasoning and non-reasoning models. Furthermore, THOR also shows consistent improvements on code generation benchmarks like HumanEval and MBPP, validating its versatility across different reasoning domains. The framework also manages to reduce inference overhead, making it computationally efficient.

The researchers behind THOR are making their code publicly available, which you can find at https://github.com/JingMog/THOR. This work represents a significant step forward in enabling LLMs to tackle complex mathematical problems with both advanced reasoning and precise computation.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

THOR: Bridging LLM Reasoning and Precise Computation for Math Problems

Building Better Tool-Integrated Data with TIRGen

Hierarchical Learning for Precision and Problem Solving

Self-Correction for Robust Inference

Impressive Performance and Generalization

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates