ORCA Benchmark Reveals Large Language Models Struggle with Real-World Calculations

TLDR: The ORCA Benchmark, a new evaluation tool, assesses large language models (LLMs) on 500 real-world quantitative reasoning tasks across various domains like finance, physics, and health. It found that leading LLMs like ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 achieved only 45-63% accuracy. The most common errors were rounding and calculation mistakes, highlighting a persistent gap between linguistic reasoning and precise numerical computation. The study suggests hybrid AI architectures, combining LLMs for problem decomposition with dedicated computational tools for accuracy, as a promising solution.

A new benchmark called ORCA (Omni Research on Calculation in AI) has been introduced to evaluate how accurately large language models (LLMs) perform real-world calculations. Developed by researchers including Claudia Herambourg, Dawid Siuda, Anna Szczepanek, Julia Kopczyńska, Joao R. L. Santos, Wojciech Sas, and Joanna Śmietańska-Nowak, this benchmark aims to bridge the gap between an LLM’s language understanding and its ability to deliver precise numerical results in everyday scenarios.

Unlike traditional math datasets that focus on academic or competition-style problems, ORCA uses 500 natural-language tasks drawn from various real-life domains such as finance, physics, health, and statistics. Each task is verified against Omni Calculator’s computational engine, ensuring a single, deterministic correct answer. This approach provides a clear measure of an LLM’s computational reliability, rather than just its ability to generate plausible-sounding explanations.

LLM Performance: A Reality Check

The study tested five leading LLMs: ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2. The overall accuracy across these models ranged from a modest 45% to 63%. Gemini 2.5 Flash achieved the highest accuracy at 63%, closely followed by Grok 4 at 62.8%. DeepSeek V3.2 came in third with 52%, while ChatGPT-5 and Claude Sonnet 4.5 showed lower but comparable performance at 49.4% and 45.2%, respectively. These results highlight that even the most advanced LLMs currently struggle with approximately half of real-world calculation tasks, indicating that progress in natural language understanding doesn’t automatically translate to consistent computational accuracy.

Understanding the Errors

A detailed analysis of incorrect responses revealed that the majority of errors stemmed from two main categories: precision or rounding issues (34.7%) and calculation mistakes (33.4%). Together, these mechanical errors accounted for over two-thirds of all failures. This suggests that while LLMs might follow a correct reasoning path, they often falter in the final numerical execution. Less frequent but still significant errors included using the wrong formula or method (13.4%) and making incorrect assumptions (11.8%). Errors like hallucinations or outright refusals were rare, indicating that the deterministic nature of the benchmark constrained the models’ generative freedom.

Domain-Specific Strengths and Weaknesses

The models’ performance varied significantly across different domains. They generally scored highest in ‘Math & Conversions’ and ‘Statistics & Probability’, with some models achieving over 70% accuracy. Gemini 2.5 Flash, for instance, excelled in ‘Statistics & Probability’ (80.6%) and ‘Math & Conversions’ (83%). Conversely, performance was weakest in ‘Physics’, ‘Biology & Chemistry’, and ‘Health & Sports’, where most models scored below 50%. DeepSeek V3.2 showed a notable specialization, performing strongly in computational and mathematical tasks but struggling significantly in domains like ‘Biology & Chemistry’ (only 10.5% accuracy). Grok 4 and Gemini 2.5 Flash demonstrated strong performance in ‘Finance & Economics’.

Collaboration for Better Accuracy

The study also looked at how models’ successes and failures correlated. While there was a moderate overlap (0.38 to 0.65), no two models showed near-perfect alignment in their error patterns. This suggests that each model possesses distinct strengths and weaknesses, implying that combining different LLMs or using hybrid architectures could lead to more robust and accurate solutions. For example, an LLM could be used to understand and decompose a problem, while a dedicated computational backend handles the precise numerical calculations.

Also Read:

The Path Forward

The ORCA benchmark underscores a critical limitation of current LLMs: they can articulate logical procedures but often fail to execute them with exact computational precision. The findings reinforce the idea that for reliable quantitative reasoning, especially in real-world applications, hybrid systems that integrate language models with specialized computational tools are likely the most promising direction. This approach would leverage the LLMs’ advanced reasoning and problem-decomposition capabilities while ensuring numerical accuracy through dedicated engines. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ORCA Benchmark Reveals Large Language Models Struggle with Real-World Calculations

LLM Performance: A Reality Check

Understanding the Errors

Domain-Specific Strengths and Weaknesses

Collaboration for Better Accuracy

The Path Forward

Gen AI News and Updates

Press Ranger and OtterlyAI Forge Alliance to Boost AI Search Visibility

MUFG Forges Alliance with OpenAI to Revolutionize Banking with Generative AI

PhonePe Integrates OpenAI’s ChatGPT for Enhanced User Experience in India

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates