TLDR: MSCoRe is a novel benchmark featuring 126,696 QA instances across automotive, pharmaceutical, electronics, and energy sectors, designed to assess LLM agents’ multi-stage collaborative reasoning. Evaluations show that while commercial LLMs lead, they face significant challenges with complex tasks, exhibit varying robustness across domains, and can be negatively impacted by few-shot examples. The benchmark’s data quality is validated by experts’ inability to distinguish AI-generated content from human-authored text.
Large Language Models (LLMs) have shown impressive capabilities in answering questions within specific domains. However, their ability to reason and coordinate across multiple, interconnected stages in complex scenarios has remained largely unexplored. Existing benchmarks often focus on isolated tasks, failing to capture the crucial interdependencies required in real-world industrial applications.
To address this critical gap, a new benchmark called MSCoRe has been introduced. MSCoRe, which stands for Multi-Stage Collaborative Reasoning, is designed to evaluate how well LLM agents can handle complex, multi-stage problems without explicit external guidance. This novel benchmark comprises 126,696 domain-specific question-answering instances, covering scenarios in the automotive, pharmaceutical, electronics, and energy sectors.
The creation of the MSCoRe dataset involved a sophisticated three-phase pipeline. This process includes dynamic sampling to ensure broad coverage, iterative question-answer generation using advanced prompt engineering, and a multi-level quality assessment. This rigorous quality control mechanism ensures the data is of high quality, a fact validated by a “Turing test” where industry experts misclassified over 85% of AI-generated data as human-authored, confirming its human-level quality.
Tasks within MSCoRe are categorized into three difficulty levels to allow for a fine-grained analysis of model performance. “Easy” tasks focus on single-stage optimization, such as selecting lightweight materials for vehicle parts. “Medium” tasks involve coordination between two or more interconnected stages, like optimizing fuel efficiency through vehicle design and manufacturing. “Hard” tasks demand holistic integration across multiple value chain stages, requiring comprehensive system-level reasoning, such as optimizing electric vehicles from design to recycling.
A comprehensive evaluation of various state-of-the-art LLM agents was conducted using MSCoRe. The results revealed several key insights. Commercial models, such as GPT-4o, generally performed best across all tasks and scenarios. However, a significant performance gap was observed between simple and complex tasks, indicating that even the most advanced models struggle with full-chain reasoning.
The study also tested the models’ robustness to increasing task complexity. Leading models like GPT-4o and the DeepSeek-R1 series demonstrated high and stable robustness, maintaining a large fraction of their performance even on challenging multi-stage problems. Interestingly, some models showed domain-specific sensitivities; for example, Phi4-14B exhibited brittleness in the automotive domain but was highly robust in the pharmaceutical sector. This suggests that a model’s reasoning stability is not a universal trait but is highly dependent on its familiarity with a specific domain’s knowledge.
Furthermore, the impact of few-shot learning was investigated. For smaller or less capable models like Bloomz-3B and Qwen2.5-7B, a single in-context example provided a noticeable performance boost, especially on hard tasks. Conversely, highly capable models like DeepSeek-R1-14B and GPT-3.5-Turbo showed a consistent degradation in performance when given a one-shot example. This counter-intuitive finding highlights the high prompt sensitivity of advanced models, suggesting that a single example might sometimes constrain their inherent reasoning process rather than enhance it.
Also Read:
- EngiBench: A New Standard for Assessing AI in Engineering Challenges
- Understanding LLM Performance Decay in Professional Training Simulations
In conclusion, MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. It serves as a challenging testbed that highlights key limitations of current LLM agents, guiding the development of more robust, adaptive, and practically deployable systems for complex industrial workflows. The code and data for MSCoRe are available for researchers to explore and build upon. You can read the full research paper for more details: MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents.


