spot_img
HomeResearch & DevelopmentMONACO: Challenging Language Models with Real-World Information Needs

MONACO: Challenging Language Models with Real-World Information Needs

TLDR: MONACO is a new benchmark featuring 1,315 natural and complex questions designed to test Large Language Models (LLMs) on real-world information-seeking tasks that require reasoning across dozens to hundreds of documents. Unlike existing benchmarks, MONACO emphasizes questions that are genuinely time-consuming for humans. Initial evaluations show that frontier LLMs struggle significantly, particularly with recall and hallucinations, even when provided with all relevant information, highlighting a critical need for advancements in LLM reasoning and retrieval capabilities.

Large Language Models, or LLMs, are increasingly becoming the go-to tools for finding information. However, current benchmarks used to test these models often fall short. They rarely feature questions that are both natural, like those a human would genuinely ask, and complex enough to be time-consuming for people to answer.

To bridge this significant gap, researchers have introduced a new benchmark called MONACO. This dataset comprises 1,315 natural and highly complex questions. What makes these questions unique is that they demand dozens, and sometimes even hundreds, of intermediate steps to solve. This level of complexity is far beyond what existing question-answering benchmarks typically offer.

The creation of MONACO involved a unique decomposed annotation pipeline. This method allowed researchers to gather and manually answer these time-consuming, natural questions on a large scale. For instance, a question like “In European countries, are left-wing political parties more likely to be headed by women than right-wing ones?” might require reviewing hundreds of pages and combining facts from over 700 distinct sources. This highlights the sheer breadth of information and reasoning required.

When frontier LLMs were tested on MONACO, their performance was modest, achieving at most a 61.2% F1 score. A significant challenge for these models was low recall, meaning they often missed relevant information, and a tendency to hallucinate, or generate incorrect facts. These results underscore a critical need for reasoning models that can better handle the complexity and vastness of real-world information-seeking questions. MONACO is designed to be an effective resource for tracking progress in this area.

How MONACO Was Built

The process of building MONACO was meticulous. Instead of using pre-defined templates, annotators were prompted to generate questions that would interest specific “target personas,” such as a history professor or an amateur chef. This approach encouraged the creation of more realistic and challenging questions. A user study confirmed that MONACO questions are perceived as more natural compared to other complex question-answering benchmarks.

Answering these complex questions is non-trivial, often requiring information from many documents. To facilitate this, a question decomposition method was used, breaking down complex questions into multiple, simpler tasks. This distributed approach allowed non-expert workers to answer the simpler intermediate steps, while an execution engine automatically derived follow-up questions and aggregated intermediate answers.

The intermediate answers in MONACO are supported by evidence from over 36,000 distinct Wikipedia pages. This evidence comes in various forms: sentences, tables, and lists, emphasizing the benchmark’s multi-modal nature. On average, each question in MONACO requires evidence from 43.3 unique pages and involves 66.5 intermediate questions.

Also Read:

LLM Performance Insights

The evaluation of 15 different LLMs on MONACO revealed several key findings. Even the most advanced models struggled significantly, indicating that reasoning over dozens of documents remains an open challenge. Reasoning-focused LLMs generally performed better than non-reasoning models. While Chain-of-Thought prompting, especially with full reasoning chains in few-shot examples, improved performance for non-reasoning LLMs, the overall scores remained far from perfect.

Model performance sharply declined as the number of intermediate answers and evidence documents increased. List questions, which require generating multiple answers, also proved challenging, with models showing high precision but significantly lower recall as the number of expected answers grew.

In an “oracle retrieval” setting, where all relevant evidence was provided to the LLM, performance improved by about 10 points compared to the closed-book setting. However, even with perfect knowledge access, models only reached around 58.7% F1, highlighting that complex reasoning itself, separate from information retrieval, is still a hurdle. Surprisingly, retrieval-augmented generation (RAG) using BM25 retrieval actually hurt LLM performance, demonstrating a lack of “retrieval robustness” where models struggle to ignore irrelevant retrieved documents.

MONACO stands as a unique and challenging testbed for evaluating LLMs on broad tasks that span hundreds of documents and demand extensive factual knowledge, information retrieval, and reasoning skills. The research paper, along with the benchmark, codebase, prompts, and model predictions, is publicly available for further exploration and development at the project’s repository.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -