spot_img
HomeResearch & DevelopmentORCA Benchmark Reveals Large Language Models Struggle with Real-World...

ORCA Benchmark Reveals Large Language Models Struggle with Real-World Calculations

TLDR: The ORCA Benchmark, a new evaluation tool, assesses large language models (LLMs) on 500 real-world quantitative reasoning tasks across various domains like finance, physics, and health. It found that leading LLMs like ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 achieved only 45-63% accuracy. The most common errors were rounding and calculation mistakes, highlighting a persistent gap between linguistic reasoning and precise numerical computation. The study suggests hybrid AI architectures, combining LLMs for problem decomposition with dedicated computational tools for accuracy, as a promising solution.

A new benchmark called ORCA (Omni Research on Calculation in AI) has been introduced to evaluate how accurately large language models (LLMs) perform real-world calculations. Developed by researchers including Claudia Herambourg, Dawid Siuda, Anna Szczepanek, Julia KopczyÅ„ska, Joao R. L. Santos, Wojciech Sas, and Joanna ÅšmietaÅ„ska-Nowak, this benchmark aims to bridge the gap between an LLM’s language understanding and its ability to deliver precise numerical results in everyday scenarios.

Unlike traditional math datasets that focus on academic or competition-style problems, ORCA uses 500 natural-language tasks drawn from various real-life domains such as finance, physics, health, and statistics. Each task is verified against Omni Calculator’s computational engine, ensuring a single, deterministic correct answer. This approach provides a clear measure of an LLM’s computational reliability, rather than just its ability to generate plausible-sounding explanations.

LLM Performance: A Reality Check

The study tested five leading LLMs: ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2. The overall accuracy across these models ranged from a modest 45% to 63%. Gemini 2.5 Flash achieved the highest accuracy at 63%, closely followed by Grok 4 at 62.8%. DeepSeek V3.2 came in third with 52%, while ChatGPT-5 and Claude Sonnet 4.5 showed lower but comparable performance at 49.4% and 45.2%, respectively. These results highlight that even the most advanced LLMs currently struggle with approximately half of real-world calculation tasks, indicating that progress in natural language understanding doesn’t automatically translate to consistent computational accuracy.

Understanding the Errors

A detailed analysis of incorrect responses revealed that the majority of errors stemmed from two main categories: precision or rounding issues (34.7%) and calculation mistakes (33.4%). Together, these mechanical errors accounted for over two-thirds of all failures. This suggests that while LLMs might follow a correct reasoning path, they often falter in the final numerical execution. Less frequent but still significant errors included using the wrong formula or method (13.4%) and making incorrect assumptions (11.8%). Errors like hallucinations or outright refusals were rare, indicating that the deterministic nature of the benchmark constrained the models’ generative freedom.

Domain-Specific Strengths and Weaknesses

The models’ performance varied significantly across different domains. They generally scored highest in ‘Math & Conversions’ and ‘Statistics & Probability’, with some models achieving over 70% accuracy. Gemini 2.5 Flash, for instance, excelled in ‘Statistics & Probability’ (80.6%) and ‘Math & Conversions’ (83%). Conversely, performance was weakest in ‘Physics’, ‘Biology & Chemistry’, and ‘Health & Sports’, where most models scored below 50%. DeepSeek V3.2 showed a notable specialization, performing strongly in computational and mathematical tasks but struggling significantly in domains like ‘Biology & Chemistry’ (only 10.5% accuracy). Grok 4 and Gemini 2.5 Flash demonstrated strong performance in ‘Finance & Economics’.

Collaboration for Better Accuracy

The study also looked at how models’ successes and failures correlated. While there was a moderate overlap (0.38 to 0.65), no two models showed near-perfect alignment in their error patterns. This suggests that each model possesses distinct strengths and weaknesses, implying that combining different LLMs or using hybrid architectures could lead to more robust and accurate solutions. For example, an LLM could be used to understand and decompose a problem, while a dedicated computational backend handles the precise numerical calculations.

Also Read:

The Path Forward

The ORCA benchmark underscores a critical limitation of current LLMs: they can articulate logical procedures but often fail to execute them with exact computational precision. The findings reinforce the idea that for reliable quantitative reasoning, especially in real-world applications, hybrid systems that integrate language models with specialized computational tools are likely the most promising direction. This approach would leverage the LLMs’ advanced reasoning and problem-decomposition capabilities while ensuring numerical accuracy through dedicated engines. You can read the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -