TLDR: This research introduces PhoPile, a new multimodal dataset for benchmarking AI models in solving Olympic-level physics problems using Retrieval-Augmented Generation (RAG). It demonstrates that RAG, which allows models to consult past problems, can significantly improve performance for both large language models (LLMs) and large multimodal models (LMMs). The study also presents an LLM-as-judge evaluation framework and highlights challenges like noisy retrievals and the need for physics-specific retrieval methods.
Foundation models, including large language models (LLMs) and large multimodal models (LMMs), have shown impressive capabilities across many tasks. However, their ability to perform expert-level reasoning, such as solving complex physics problems found in Olympiad competitions, has remained largely unexplored. This research delves into this gap, drawing inspiration from how students prepare for such competitions: by reviewing past problems to understand concepts and strategies.
The core of this study is the introduction of PhoPile, a novel, high-quality multimodal dataset specifically designed for Olympiad-level physics. Unlike previous datasets, PhoPile incorporates diagrams, graphs, and equations, reflecting the inherently multimodal nature of real-world physics problem-solving. This dataset is structured into two main parts: an evaluation set of 390 problems from 2019–2021 to test current model performance, and a much larger retrieval corpus of 2,662 problems from earlier years, which serves as an external knowledge base for the models.
The researchers investigated the potential of Retrieval-Augmented Generation (RAG) to enhance physics reasoning in these foundation models. RAG works by allowing a model to access and integrate external knowledge sources—in this case, past physics problems and their solutions from the PhoPile retrieval corpus—into its problem-solving process. The RAG pipeline involves a ‘retriever’ that finds the most relevant past problems for a given new question, and a ‘generator’ (the foundation model) that uses this retrieved information to formulate an answer. A ‘reflection’ mechanism, powered by GPT-4, was also incorporated to help the model compare and select the best answer, mitigating potential noise from retrieved examples.
To accurately evaluate the models’ performance, a new LLM-as-judge evaluation framework was developed. This framework uses GPT-4 to grade candidate solutions against reference answers, assigning scores from 0 to 10. This method accounts for both the correctness of the final answer and the quality of intermediate reasoning steps, which is crucial for complex physics problems. Human evaluations confirmed that GPT-4 provides consistent judgments, making this a scalable and reliable scoring method.
The benchmarking results demonstrated that integrating retrieval with physics corpora can indeed improve model performance. For instance, Gemini-Pro, when combined with the Contriever retrieval method, saw a substantial increase in its pass rate from 17.18% to 30.51%. Similarly, LLaMA-3-70B improved from 10.51% to 19.07% with BM25. The reflection mechanism also yielded noticeable performance improvements by reducing the negative impact of irrelevant retrieved content. Furthermore, fine-tuning open-source models on the retrieval corpus led to significant gains, with some models showing performance increases by factors ranging from 5 to 17.
The study also explored multimodal retrieval, using models like CLIP, ALIGN, and VisualBERT to obtain joint text-image embeddings. Both Gemini-Pro-V and GPT-4V showed improvements with multimodal RAG, highlighting the importance of visual information in physics problems. GPT-4V benefited most from CLIP, achieving a 30.10% pass rate, while Gemini-Pro-V saw gains with VisualBERT.
Despite these advancements, the research identified several challenges. General-purpose retrievers are not always optimal for physics problems, as they might prioritize semantic similarity over conceptual relevance. The format of retrieved examples can sometimes mislead models, causing them to provide guidelines instead of direct answers or to incorrectly use conditions from past problems. These findings underscore the need for domain-specific retrievers and more robust RAG systems. The full research paper can be found here.
Also Read:
- AI System Achieves Gold Medal Performance in Physics Olympiads
- GRAD: Enhancing LLM Reasoning with Dynamic, Generated Examples
In conclusion, this work presents PhoPile as a crucial benchmark for evaluating AI’s physics reasoning capabilities with RAG. It provides a comprehensive study of various foundation models and retrievers, demonstrating the potential of RAG to enhance problem-solving while also pointing towards areas for future research, such as developing multimodal cross-referencing and more sophisticated physics-specific retrieval methods.


