TLDR: A new benchmark, INPHYRE, reveals that large multimodal models (LMMs) struggle with inductive physical reasoning. They have limited understanding of universal physical laws, fail to adapt to new physics from visual demonstrations, and heavily rely on language cues over visual input when attempting to infer new physical rules. This raises concerns about their trustworthiness in safety-critical applications.
Large multimodal models (LMMs) are advanced artificial intelligence systems that learn from both visual and textual information. They are designed to understand and predict outcomes in various scenarios, including those governed by physical laws. For instance, an LMM might predict the result of a car collision based on a video. These models encode universal physical laws, like momentum conservation, as ‘parametric knowledge’ during their training. This allows them to answer questions about physical events from visual input.
Understanding How AI Models Reason About Physics
However, a significant challenge arises when the real-world situation deviates from the physical laws the LMMs were trained on. Imagine an LMM trained on dry road conditions being asked to predict a car crash on a snowy road. Humans can quickly adapt their understanding of physics to new conditions by observing a few examples – a skill called ‘inductive physical reasoning’. This ability is crucial for LMMs, especially in critical applications like autonomous driving, where unseen physical environments can lead to dangerous mispredictions.
Despite its importance, existing benchmarks for LMMs primarily evaluate their ‘parametric knowledge’ – essentially, how well they recall and apply the physical laws they’ve already learned. They don’t assess the models’ capacity for inductive physical reasoning, which involves inferring new physical laws from limited visual examples.
Introducing INPHYRE: A New Benchmark for Inductive Physical Reasoning
To address this gap, researchers Gautam Sreekumar and Vishnu Naresh Boddeti from Michigan State University proposed INPHYRE (Inductive Physical Reasoning). This is the first visual question answering benchmark specifically designed to measure inductive physical reasoning in LMMs. INPHYRE evaluates LMMs by presenting them with algorithmically generated synthetic videos of collision events. Some of these scenarios intentionally violate universal physical laws. The LMMs must then infer the underlying, altered physics from demonstration videos and use this new understanding to predict outcomes.
The benchmark focuses on fundamental laws of mechanics, such as momentum and energy conservation. By using ‘impossible scenarios’ – situations that defy real-world physics – INPHYRE aims to test how adaptable LMMs are to physical laws they haven’t encountered during their initial training, rather than to enhance their utility in unrealistic situations. The scenarios are categorized into ‘momentum conservation violation’, ‘inconsistent physics’ (where visual properties affect physical laws), and ‘miscellaneous’ (to check for visual biases).
Key Discoveries from INPHYRE
The INPHYRE benchmark was used to evaluate 13 different LMMs, revealing several critical insights into their physical reasoning capabilities:
1. Limited Parametric Knowledge
Surprisingly, LMMs showed limited parametric knowledge of universal physical laws. Even in ‘regular’ scenarios that followed true physical laws, many models struggled to apply basic principles like momentum conservation. While they could often state these laws as factual information, they frequently failed to apply them correctly in reasoning tasks, sometimes even hallucinating irrelevant assumptions or contradictory statements.
2. Exemplars Only Help When Physics Align
Demonstration samples (exemplars) that included both videos and question-answer pairs did improve LMMs’ predictions in regular scenarios. This shows that LMMs can use examples to augment their existing knowledge. However, this improvement was largely observed when the exemplars aligned with the models’ pre-trained understanding of universal physical laws.
3. The Challenge of Inductive Physical Reasoning
When presented with ‘irregular’ scenarios where the demonstration samples violated universal physical laws, LMMs demonstrated weak inductive physical reasoning. Most models showed a significant drop in accuracy compared to their performance in regular scenarios. For example, in scenarios designed to test if LMMs could differentiate between volume and mass, almost all models struggled considerably.
4. The Pervasive Language Bias
Perhaps the most concerning finding was the strong language bias in LMMs’ inductive physical reasoning. When exemplars included only videos (without accompanying question-answer text), the models’ performance dropped significantly. This suggests that LMMs primarily rely on the textual content of the exemplars to infer new physical laws, largely ignoring the visual inputs. This raises serious questions about the trustworthiness of LMMs in situations where visual cues are paramount for understanding unseen physical laws.
Also Read:
- Test-Time Warmup: Enhancing Multimodal AI’s Visual Reasoning Capabilities
- Enhancing Multimodal AI Safety: A New Approach to Optimizing Reasoning Paths
Implications for the Future of AI
The findings from INPHYRE highlight that current LMMs, despite their impressive capabilities, are not yet adept at adapting their physical reasoning to novel environments based on visual evidence alone. Their reliance on language and struggle with applying physical laws flexibly indicate a need for rethinking how these models are trained. Future efforts might involve instruction-tuning within simulation environments that provide direct feedback, similar to reinforcement learning with human feedback (RLHF), to better integrate visual understanding with physical reasoning.
For a deeper dive into the research, you can read the full paper here.


