TLDR: INTEGRAL BENCH is a new benchmark of 317 graduate-level definite integral problems, complete with symbolic and numerical solutions and manual difficulty ratings, designed to assess Large Language Models (LLMs) in advanced mathematical reasoning. Evaluations of nine state-of-the-art LLMs reveal that while larger models generally perform better, model architecture and training methodology are critical, with some smaller models outperforming larger ones. Performance significantly declines with increasing problem difficulty, highlighting current limitations in complex mathematical reasoning. Common failure modes include output truncation, circular reasoning, format violations, refusal to provide symbolic answers, and inconsistencies between symbolic and numerical results.
Large Language Models (LLMs) have shown impressive capabilities across many domains, but their performance in advanced mathematical reasoning, particularly in complex areas like integral calculus, remains a significant challenge. A new research paper introduces INTEGRAL BENCH, a specialized benchmark designed to rigorously evaluate how well LLMs can solve definite integral problems.
Mathematical reasoning is considered a high form of human intelligence and is a crucial test for LLMs. While existing benchmarks like MATH and GSM8K assess general mathematical skills, they often lack the depth and specific focus needed for comprehensive evaluation of integral problems. Definite integrals are particularly challenging because they require sophisticated multi-step reasoning, including breaking down complex expressions, recognizing patterns for simplification, and recalling various integration methods.
The creators of INTEGRAL BENCH identified several limitations in current evaluation frameworks for integrals: insufficient challenging problems, a lack of specific metrics for symbolic versus numerical solution accuracy, and inadequate difficulty gradation. To address these gaps, INTEGRAL BENCH was developed, featuring 317 carefully selected graduate-level definite integral problems. These problems are sourced from advanced textbooks and competitions, and each comes with both symbolic and numerical ground truth solutions, allowing for precise evaluation of LLM-generated answers.
A unique aspect of INTEGRAL BENCH is its manual annotation of difficulty ratings, ranging from 1 (easiest) to 5 (most difficult), which enables a fine-grained analysis of model performance across varying complexity levels. The benchmark also uses a novel term-rewriting method to create problem variations, helping to prevent dataset contamination while maintaining mathematical accuracy.
The construction of INTEGRAL BENCH involved a systematic process, balancing cost, difficulty, and relevance. This included collecting problems from graduate-level textbooks and integral competitions, manually annotating them with ground truth answers and metadata, converting problem images to LaTeX using OCR, and instantiating parameters for problems with free variables. Human experts played a crucial role in verifying the correctness of solutions and assigning difficulty ratings.
The researchers evaluated nine state-of-the-art LLMs, including Claude 3.7, GPT-4.1, and Qwen3-235B-A22B, on INTEGRAL BENCH. The findings revealed several key insights. Generally, larger models performed better, with Qwen3-235B-A22B achieving the highest accuracy for both numerical (50.16%) and symbolic (56.15%) solutions. However, model size alone was not the sole determinant of performance; the 32B QwQ model surprisingly outperformed larger models like GPT-4.1 and Claude 3.7, highlighting the significant impact of architecture and training methodology.
A strong negative correlation was observed between problem difficulty and model accuracy across all evaluated models. While LLMs performed well on easier problems (difficulty 1-2), their accuracy dropped sharply on the most challenging ones (difficulty 4-5), often approaching zero. This finding validates the benchmark’s difficulty annotations and points to current limitations in LLMs’ ability to handle complex mathematical reasoning.
Analysis of inference-time scaling showed that models rapidly gained accuracy during initial token consumption, then plateaued after reaching model-specific “sweet spots.” This suggests varying efficiencies in how models extract and process information during extended reasoning tasks.
The study also identified common failure modes in LLM responses. These included output truncation, where models stopped generating solutions prematurely due to verbose reasoning; circular reasoning patterns, where models got stuck in repetitive computations; format violations, where correct answers were presented in unparsable formats; and refusal to provide symbolic answers, even for problems with known analytical solutions. The most prevalent issue was symbolic-numerical inconsistency, where models provided correct symbolic solutions but incorrect numerical evaluations, indicating a weakness in accurate numerical computation despite strong symbolic manipulation skills.
Also Read:
- Assessing LLM Capabilities in Answer Set Programming: A New Benchmark Reveals Core Challenges
- MMCircuitEval: A New Benchmark for Assessing AI in Circuit Design
While INTEGRAL BENCH provides a robust framework, the authors acknowledge limitations such as the reliance on human verification, the variability introduced by LLM inference randomness, and potential issues with numerical stability. Future work aims to expand the dataset using more automated methods, explore its use for fine-tuning LLMs, and integrate external computational tools to augment LLM capabilities. This benchmark is a valuable resource for guiding future architectural improvements in mathematical LLMs and advancing automated mathematical reasoning. You can find more details about the research paper here: INTEGRAL BENCH Research Paper.


