TLDR: CFDLLMBench is a new benchmark suite designed to assess Large Language Models’ (LLMs) abilities in Computational Fluid Dynamics (CFD). It evaluates graduate-level CFD knowledge, numerical and physical reasoning, and practical workflow implementation using OpenFOAM. While LLMs show strong knowledge recall, they significantly struggle with complex coding and simulation tasks, achieving low success rates. The benchmark highlights the need for advanced agentic frameworks and improved spatial reasoning in LLMs for scientific automation.
Large Language Models (LLMs) have shown incredible capabilities in various natural language processing tasks, from writing essays to answering complex questions. However, their potential to automate numerical experiments in highly specialized scientific fields, such as Computational Fluid Dynamics (CFD), has remained largely unexplored. CFD, which involves simulating fluid flow, is a critical and labor-intensive component in many scientific and engineering domains, including aerospace, climate modeling, and robotics.
To address this gap, researchers have introduced CFDLLMBench, a comprehensive benchmark suite designed to rigorously evaluate how well LLMs can handle the complexities of CFD. This benchmark aims to provide a holistic assessment of LLM performance across three crucial competencies: graduate-level CFD knowledge, numerical and physical reasoning, and the practical implementation of CFD workflows.
The Three Pillars of CFDLLMBench
CFDLLMBench is structured into three complementary components, each targeting a specific aspect of CFD expertise:
1. CFDQuery: This component assesses an LLM’s foundational understanding of CFD. It consists of 90 multiple-choice questions curated from graduate-level CFD lecture notes. These questions delve into core concepts of fluid mechanics, linear algebra, and numerical methods, testing the LLM’s ability to recall and understand specialized knowledge.
2. CFDCodeBench: Moving beyond theoretical knowledge, CFDCodeBench evaluates an LLM’s capacity for numerical and physical reasoning. It presents 24 CFD programming tasks that require LLMs to generate correct simulation code from natural language descriptions of physical problems. These problems range from 1D to 2D scenarios, involving both linear and nonlinear Partial Differential Equations (PDEs) commonly encountered in CFD.
3. FoamBench: This is the most practical and challenging component. FoamBench focuses on the context-dependent implementation of CFD workflows using OpenFOAM, a widely used open-source CFD software suite. It includes 110 basic and 16 advanced numerical simulation tasks, drawn from real-world engineering problems. For these tasks, LLMs must generate all necessary input files, configure the simulation, and execute it correctly to produce physically accurate results.
Key Findings: A Gap Between Knowledge and Application
The evaluation of state-of-the-art proprietary and open-source LLMs using CFDLLMBench revealed interesting insights. While models demonstrated relatively strong performance on CFDQuery, indicating good recall of CFD knowledge, their success rates drastically dropped for the more practical and reasoning-intensive tasks.
For instance, the best-performing model achieved only a 14% success rate on CFDCodeBench and 34% on FoamBench Basic, with performance dropping to 25% on the more complex FoamBench Advanced tasks. This highlights a significant challenge: LLMs can store and retrieve information, but they struggle with applying advanced math and physics knowledge to solve difficult problems, selecting suitable numerical methods, and configuring complex simulation software like OpenFOAM.
The study also emphasized the importance of agentic frameworks, which mimic human troubleshooting by incorporating components like Retrieval-Augmented Generation (RAG) and Reviewers. Zero-shot prompting (without these frameworks) for OpenFOAM tasks resulted in near-zero performance, underscoring the critical role of these advanced setups in achieving any meaningful success.
Also Read:
- SciTrek: A New Benchmark for Long-Context LLM Reasoning in Science
- GAUSS: A New Benchmark for Dissecting AI’s Math Abilities
Challenges and Future Directions
One particular area identified for significant improvement is spatial reasoning. In tasks requiring LLMs to generate new geometries and meshes based on natural language descriptions, such as simulating flow over complex obstacles, models often produced incorrect geometries. This indicates a fundamental gap in current LLMs’ ability to understand and translate spatial configurations into accurate computational models.
CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation in numerical experiments for complex physical systems. The benchmark’s code and data are openly available, encouraging future research to advance LLM capabilities in scientific computing. This work is a crucial step towards realizing the full potential of LLMs as scientific assistants, capable of automating labor-intensive simulation workflows. You can find the full research paper here: CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics.


