TLDR: MOLERR2FIX is a novel benchmark designed to assess the trustworthiness of Large Language Models (LLMs) in chemistry by evaluating their ability to detect, localize, explain, and correct chemical errors in molecular descriptions. Unlike previous benchmarks, it focuses on fine-grained chemical understanding across four stages. The study reveals significant performance gaps in current LLMs, especially in error explanation and revision, highlighting the need for more robust chemical reasoning capabilities beyond fluent language generation.
Large Language Models (LLMs) have shown incredible potential across many fields, and molecular sciences are no exception. However, a significant challenge remains: these powerful AI tools often produce chemically inaccurate descriptions and struggle to identify or justify their own errors. This raises serious questions about their reliability in critical scientific applications.
To address this, researchers have introduced a new benchmark called MOLERR2FIX. This innovative tool is designed to rigorously evaluate how well LLMs can detect and correct errors in molecular descriptions. Unlike previous benchmarks that focused on generating text from molecules or predicting properties, MOLERR2FIX dives deep into fine-grained chemical understanding.
What is MOLERR2FIX?
MOLERR2FIX challenges LLMs with four key tasks: identifying, localizing, explaining, and revising potential structural and semantic errors in descriptions of molecules. The benchmark comprises 1,193 detailed error instances, each meticulously annotated with the error type, its exact location in the text, an explanation of why it’s wrong, and the correct revision. These tasks are crafted to mirror the complex reasoning and verification processes that chemists use in their daily work.
The evaluation process is broken down into four stages:
- Error Detection: Can the LLM simply tell if there’s an error in a given molecular description?
- Error Localization: If an error exists, can the LLM pinpoint the exact phrase or segment of text that is incorrect?
- Error Explanation: Can the LLM explain, in natural language, the chemical principle that has been violated?
- Error Revision: Can the LLM provide a chemically accurate correction for the identified erroneous text?
How the Benchmark Was Created
The MOLERR2FIX dataset was built using a two-stage process. First, problematic molecular descriptions were generated by various LLMs (like GPT-4o, Claude, and Gemini) from a dataset of molecular structures (SMILES format). These descriptions, while often fluent, frequently contained chemical inaccuracies.
In the second stage, a panel of chemistry experts meticulously analyzed these problematic captions. They localized the errors, classified them into six predefined categories (Functional Group/Substituent, Classification, Derivation, Stereochemistry, Sequence/Composition, and Indexing Errors), explained the chemical reasons for the errors, and provided accurate corrections. This rigorous process ensures the high quality and chemical accuracy of the benchmark.
Key Findings and LLM Performance
Evaluations of current state-of-the-art LLMs, including general-purpose models, reasoning-enhanced models, and chemistry-specific models, revealed significant performance gaps. While some models showed moderate success in detecting errors, they often struggled immensely with localizing, explaining, or correcting them accurately.
For instance, Error Detection was found to be the most manageable task for LLMs, with some models achieving reasonable F1 scores. However, Error Revision proved to be the most challenging, with models barely achieving low BLEU scores, indicating a severe lack of ability to generate accurate corrections. Error Localization and Explanation fell somewhere in between, still posing considerable difficulty.
Interestingly, domain-specific LLMs, which are fine-tuned for chemistry, did not necessarily outperform general-purpose models in these tasks, especially if they lacked strong instruction-following and reasoning capabilities. This suggests that simply having chemical knowledge isn’t enough; the ability to apply that knowledge in a structured, diagnostic way is crucial.
The study also highlighted common types of errors LLMs make, such as misidentifying functional groups, incorrectly classifying molecules, or making errors related to chemical derivation. LLMs also struggled significantly with tasks requiring precise numerical or spatial reasoning, like counting atoms or correctly identifying the position of substituents (indexing errors).
Why Do LLMs Struggle?
The researchers propose several reasons for these limitations. One major factor is the scarcity of highly specialized domain knowledge in typical LLM training corpora. For example, intricate indexing rules for complex ring systems are rarely detailed in standard texts. Additionally, inferring stereochemistry from SMILES representations is a complex task that non-reasoning models find difficult. The inherent ambiguity of multiple valid chemical names also adds to the challenge.
Also Read:
- Large Language Models Transform Chemical Experiment Optimization
- New Approach Helps Language Models Pinpoint Math Errors
The Path Forward
The MOLERR2FIX benchmark, developed by Yuyang Wu and colleagues, clearly maps the current limitations of LLMs in chemical reasoning. It underscores the need for more robust, chemically informed language models that can act as truly trustworthy scientific assistants. The authors advocate for chemistry-centric pretraining architectures, the integration of self-reflection loops for iterative debugging, and the expansion of benchmarks to cover even richer chemistries and error modes. This work is a crucial step towards developing AI that can genuinely reason chemically, rather than just linguistically. You can read the full paper here.


