TLDR: ExtremBench is a new benchmark dataset of 93 mathematical extremal problems, derived from Chinese Mathematical Olympiad inequality exercises, designed to evaluate Large Language Models’ (LLMs) optimization reasoning capabilities. The research reveals that LLMs’ performance on extremal problems often doesn’t correlate with their scores on general mathematical benchmarks, highlighting a critical gap in current evaluation methods and the need for domain-specific assessments.
Large Language Models (LLMs) have shown impressive reasoning abilities, especially in mathematics, often by using intermediate thought processes before giving a final answer. However, how these reasoning skills truly work isn’t fully understood. One crucial area of mathematical reasoning is optimization – finding the maximum or minimum values under specific conditions. This skill is vital for many real-world applications like planning, control systems, allocating resources, and even optimizing prompts for AI.
Despite its importance, current mathematical benchmarks for LLMs, such as GSM8K, MATH-500, and AIME, largely overlook optimization reasoning. These benchmarks tend to focus more on algebraic manipulation and basic arithmetic, leaving the complex demands of extremal problems unevaluated. Extremal problems require a unique set of skills, including identifying boundaries, understanding trade-offs, and recognizing critical points where optimal solutions occur.
Introducing ExtremBench: A New Benchmark for Optimization Reasoning
To address this significant gap, researchers have introduced ExtremBench, a specialized benchmark dataset designed to systematically evaluate LLMs’ ability to solve mathematical extremal problems. This dataset was carefully created from inequality exercises used in the Chinese Mathematical Olympiad. These proof-style problems were transformed into 93 standardized extrema-finding tasks, making them suitable for automated evaluation while retaining their original mathematical complexity.
For instance, a problem asking to “prove that A ≤ B” under certain conditions is reformulated as “find the maximum of A – B” with the same conditions. This innovative conversion allows for numerical verification of answers, which is crucial for training and evaluating advanced AI models.
Key Findings: A Disconnect in Mathematical Abilities
Extensive evaluations were conducted across various state-of-the-art open-source LLM families, including Qwen3, GPT-OSS, and DeepSeek. The results revealed surprising discrepancies in how LLMs perform on extremal problems compared to their performance on general mathematical benchmarks. Here are some key insights:
-
Models that excel in general mathematical reasoning, like GPT-OSS-120B-High (scoring over 90% on AIME25), showed a plateau in ExtremBench performance, hovering around 70%. This suggests that strong general math skills don’t automatically translate to proficiency in optimization tasks.
-
Interestingly, larger models did not consistently outperform smaller ones on ExtremBench. For example, Qwen3-14B achieved similar performance to Qwen3-235B, despite having significantly fewer parameters. This indicates that extremal-solving ability might depend more on specific training data or architectural choices rather than just raw model scale.
-
The Qwen3-Thinking variants demonstrated the strongest performance on ExtremBench (75-80%), even with moderate scores on AIME25. Conversely, DeepSeek-R1 models consistently showed lower performance on both benchmarks.
These findings underscore that solving extremal problems represents a distinct mathematical competency that existing benchmarks fail to capture. This highlights a critical blind spot in current evaluation practices and emphasizes the need for specialized frameworks like ExtremBench for a comprehensive assessment of LLM mathematical capabilities. For more detailed information, you can refer to the full research paper: Max It or Miss It: Benchmarking LLM On Solving Extremal Problems.
Also Read:
- A New Math Benchmark Challenges AI’s Reasoning Boundaries
- Unveiling Conjecturing as a Key Step in AI’s Mathematical Reasoning
Future Directions
The introduction of ExtremBench opens several avenues for future research. The methodology of converting hard-to-verify proofs into numerically verifiable problems could be applied to other mathematical domains, such as combinatorics, geometry, and analysis. Expanding ExtremBench to include more complex optimization scenarios, like multi-objective or discrete optimization, would further enhance its evaluative power. Additionally, investigating the underlying reasons for the observed discrepancies could provide valuable insights into how LLMs process different types of mathematical knowledge, potentially leading to more targeted training strategies.


