TLDR: ABench-Physics is a novel benchmark designed to rigorously evaluate Large Language Models (LLMs) in physics. It comprises 400 static, high-difficulty problems (Phy A) and 100 dynamic problems (Phy B) with an automatic variation engine to prevent memorization and test generalization. Evaluations of state-of-the-art LLMs show significant performance gaps, with even top models struggling (e.g., 43.0% on Phy A) and a substantial average accuracy drop of 22.5% on dynamic problems, indicating a reliance on pattern matching over genuine physical understanding. The benchmark highlights the need for LLMs with more robust scientific reasoning.
Large Language Models (LLMs) have shown remarkable capabilities in various fields like mathematics and programming. However, their true understanding and reasoning abilities in physics have remained largely unexplored and poorly understood. Physics problems are unique because they demand not only precise calculations but also a deep grasp of concepts and the ability to create physical models.
Existing benchmarks for evaluating LLMs in physics often fall short. Many are too easy, use multiple-choice formats, or are static, meaning models can simply memorize answers rather than genuinely understand the underlying principles. This can lead to inflated performance metrics that don’t truly reflect a model’s physical modeling skills or its ability to generalize knowledge to new situations.
To address these limitations, researchers Yiming Zhang, Yingfan Ma, Yanmei Gu, Zhengkai Yang, Yihong Zhuang, Feng Wang, Zenan Huang, Yuanyuan Wang, Chao Huang, Bowen Song, Cheng Lin, and Junbo Zhao have introduced ABench-Physics. This new benchmark is designed to rigorously evaluate LLMs’ physical reasoning and generalization capabilities. You can find the full research paper here: ABench-Physics Research Paper.
What is ABench-Physics?
ABench-Physics is composed of two main parts:
- Phy A: This is a static set of 400 highly difficult problems. These problems are at a graduate or Olympiad level, providing a consistent and challenging baseline for evaluating model performance.
- Phy B: This is a dynamic subset of 100 problems. Its key innovation is an automatic variation engine that can alter numerical values within the problems. This dynamic design prevents models from relying on memorization and instead forces them to demonstrate genuine physical modeling and computational ability across changing conditions.
Unlike previous benchmarks that might use multiple-choice questions or require expression-based answers, ABench-Physics focuses exclusively on numerical calculation problems. Answers are evaluated with a strict 1% tolerance for error. For the dynamic Phy B set, a model only gets credit if it correctly solves all variations of a given problem, making it a stringent test of adaptability and generalization.
Key Findings
The researchers evaluated several state-of-the-art LLMs using ABench-Physics. The results revealed significant performance gaps, highlighting ongoing limitations in physical reasoning, especially when it comes to generalizing to dynamic variants.
- Even top models like Gemini 2.5 Pro performed poorly on the static, high-difficulty Phy A problems, achieving a maximum accuracy of only 43.0%. This indicates a substantial gap in advanced physics reasoning for current LLMs.
- More notably, when moving from static to dynamic problems (Phy B), all models showed a significant average performance drop of 22.5%. This finding strongly suggests that models often rely on memorization rather than a stable understanding of physical principles.
- Interestingly, models fine-tuned with reinforcement learning (RL) showed smaller performance losses on dynamic questions compared to those trained with supervised fine-tuning (SFT). This suggests that RL training might better equip LLMs to handle numerical variations that are outside their initial training distribution.
Also Read:
- MateInfoUB: A New Benchmark Reveals LLM Strengths and Weaknesses in Competitive Computer Science Education
- Assessing LLM Agent Memory: A New Benchmark for Interactive Intelligence
Why ABench-Physics Matters
ABench-Physics provides a challenging and diagnostic framework for advancing scientific reasoning in LLMs. By focusing on numerical answers, strict evaluation criteria, and dynamic problem variations, it pushes models beyond simple pattern matching towards deeper and more robust scientific understanding. This benchmark is a valuable tool for the research community, encouraging the development of LLMs with stronger scientific reasoning skills.


