MONACO: Challenging Language Models with Real-World Information Needs

TLDR: MONACO is a new benchmark featuring 1,315 natural and complex questions designed to test Large Language Models (LLMs) on real-world information-seeking tasks that require reasoning across dozens to hundreds of documents. Unlike existing benchmarks, MONACO emphasizes questions that are genuinely time-consuming for humans. Initial evaluations show that frontier LLMs struggle significantly, particularly with recall and hallucinations, even when provided with all relevant information, highlighting a critical need for advancements in LLM reasoning and retrieval capabilities.

Large Language Models, or LLMs, are increasingly becoming the go-to tools for finding information. However, current benchmarks used to test these models often fall short. They rarely feature questions that are both natural, like those a human would genuinely ask, and complex enough to be time-consuming for people to answer.

To bridge this significant gap, researchers have introduced a new benchmark called MONACO. This dataset comprises 1,315 natural and highly complex questions. What makes these questions unique is that they demand dozens, and sometimes even hundreds, of intermediate steps to solve. This level of complexity is far beyond what existing question-answering benchmarks typically offer.

The creation of MONACO involved a unique decomposed annotation pipeline. This method allowed researchers to gather and manually answer these time-consuming, natural questions on a large scale. For instance, a question like “In European countries, are left-wing political parties more likely to be headed by women than right-wing ones?” might require reviewing hundreds of pages and combining facts from over 700 distinct sources. This highlights the sheer breadth of information and reasoning required.

When frontier LLMs were tested on MONACO, their performance was modest, achieving at most a 61.2% F1 score. A significant challenge for these models was low recall, meaning they often missed relevant information, and a tendency to hallucinate, or generate incorrect facts. These results underscore a critical need for reasoning models that can better handle the complexity and vastness of real-world information-seeking questions. MONACO is designed to be an effective resource for tracking progress in this area.

How MONACO Was Built

The process of building MONACO was meticulous. Instead of using pre-defined templates, annotators were prompted to generate questions that would interest specific “target personas,” such as a history professor or an amateur chef. This approach encouraged the creation of more realistic and challenging questions. A user study confirmed that MONACO questions are perceived as more natural compared to other complex question-answering benchmarks.

Answering these complex questions is non-trivial, often requiring information from many documents. To facilitate this, a question decomposition method was used, breaking down complex questions into multiple, simpler tasks. This distributed approach allowed non-expert workers to answer the simpler intermediate steps, while an execution engine automatically derived follow-up questions and aggregated intermediate answers.

The intermediate answers in MONACO are supported by evidence from over 36,000 distinct Wikipedia pages. This evidence comes in various forms: sentences, tables, and lists, emphasizing the benchmark’s multi-modal nature. On average, each question in MONACO requires evidence from 43.3 unique pages and involves 66.5 intermediate questions.

Also Read:

LLM Performance Insights

The evaluation of 15 different LLMs on MONACO revealed several key findings. Even the most advanced models struggled significantly, indicating that reasoning over dozens of documents remains an open challenge. Reasoning-focused LLMs generally performed better than non-reasoning models. While Chain-of-Thought prompting, especially with full reasoning chains in few-shot examples, improved performance for non-reasoning LLMs, the overall scores remained far from perfect.

Model performance sharply declined as the number of intermediate answers and evidence documents increased. List questions, which require generating multiple answers, also proved challenging, with models showing high precision but significantly lower recall as the number of expected answers grew.

In an “oracle retrieval” setting, where all relevant evidence was provided to the LLM, performance improved by about 10 points compared to the closed-book setting. However, even with perfect knowledge access, models only reached around 58.7% F1, highlighting that complex reasoning itself, separate from information retrieval, is still a hurdle. Surprisingly, retrieval-augmented generation (RAG) using BM25 retrieval actually hurt LLM performance, demonstrating a lack of “retrieval robustness” where models struggle to ignore irrelevant retrieved documents.

MONACO stands as a unique and challenging testbed for evaluating LLMs on broad tasks that span hundreds of documents and demand extensive factual knowledge, information retrieval, and reasoning skills. The research paper, along with the benchmark, codebase, prompts, and model predictions, is publicly available for further exploration and development at the project’s repository.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

MONACO: Challenging Language Models with Real-World Information Needs

How MONACO Was Built

LLM Performance Insights

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates