Deep Learning Continues to Lead in Information Retrieval: Insights from TREC 2021

TLDR: The TREC 2021 Deep Learning Track evaluated advanced information retrieval methods using significantly expanded MS MARCO datasets. The track confirmed the superior performance of deep neural ranking models, especially those utilizing large-scale pretraining, over traditional methods. It also highlighted the growing trend towards single-stage retrieval with deep models. The report discusses the challenges of dataset scale, judgment completeness, and the impact of query length on evaluation, offering insights into future directions for information retrieval research and benchmarking.

The TREC Deep Learning Track, now in its third year, continues to be a pivotal benchmark for ad hoc retrieval methods, particularly in the context of vast datasets. The 2021 edition brought significant updates, leveraging refreshed and substantially expanded versions of the MS MARCO datasets. These datasets are crucial as they provide hundreds of thousands of human-annotated training labels for both passage and document ranking tasks.

A major highlight of the TREC 2021 track was the introduction of the MS MARCO v2 dataset. This new version dramatically increased the scale of the collections, with the document collection growing nearly four times and the passage collection expanding by almost sixteen times. This expansion aimed to provide a more realistic large-data environment for evaluating retrieval systems and to incorporate additional metadata, such as passage-to-document mappings, which can be valuable for ranking.

Key Tasks and Evaluation

Similar to previous years, the 2021 Deep Learning Track featured two primary tasks: document retrieval and passage retrieval. Participants could submit up to three runs for each task, detailing the external data, pre-trained models, and other resources used, as well as the model style. A consistent set of 477 queries was used across both tasks, with a subset selected for judging based on query length (short vs. long queries).

For evaluation, judgments were collected on a four-point scale, ranging from ‘Irrelevant’ (0) to ‘Perfectly relevant’ (3). The document retrieval task included two subtasks: full retrieval, which models an end-to-end scenario from the entire document collection, and top-100 reranking, where participants re-ranked an initial set of 100 documents provided by Pyserini. The passage retrieval task followed a similar structure, with full retrieval from a large passage collection and a top-100 reranking subtask.

The Evolution of MS MARCO Datasets

The MS MARCO dataset originated from a natural language generation task, where crowd workers generated answers to queries based on provided passages. This data was later adapted for ranking tasks, leading to the v1 datasets used in TREC 2019 and 2020. The v1 datasets, while valuable, had some limitations, such as corpus generation based on queries and character set issues.

The MS MARCO v2 dataset, used for the first time in TREC 2021, addressed many of these issues. It started by identifying documents, expanding the collection to 11.9 million documents, and then identifying promising passages within them, resulting in 138 million passages. The v2 data also improved character encoding and whitespace issues, making it a cleaner and more comprehensive resource for information retrieval research. For more in-depth details, you can refer to the full research paper here.

Performance Trends: Neural vs. Traditional Methods

The track saw participation from 19 groups, submitting a total of 129 runs. A notable trend observed was the continued dominance of deep neural ranking models, particularly those employing large-scale pretraining (categorized as ‘nnlm’). These ‘nnlm’ runs consistently outperformed traditional retrieval methods (‘trad’) across both document and passage ranking tasks. The percentage of ‘nnlm’ submissions significantly increased, while runs without pre-trained models (‘nn’) almost disappeared, indicating a convergence in the neural information retrieval community towards large language models.

While ‘nnlm’ runs showed clear superiority, the paper also explored the performance of single-stage retrieval methods. Surprisingly, these methods performed well, though they still lagged behind multi-stage retrieval pipelines. The analysis also delved into how system performance varied with query length, finding that longer queries might be more discriminative for evaluation, especially for neural systems.

Also Read:

Challenges and Future Directions

The increased size of the v2 datasets, while beneficial for realism, posed challenges for judgment completeness due to budget constraints for NIST assessors. This led to concerns about the reusability of the dataset for benchmarking outside of TREC settings. The paper also discussed the agreement between NIST judgments and the original sparse MS MARCO labels, noting a decrease in agreement over the years, partly attributed to an ‘oldness’ artifact where models learned to favor older documents in the corpus due to the training data’s characteristics.

Looking ahead, the track organizers are considering options for future evaluations, such as focusing on a ‘v1 universe’ for development set evaluations or adjusting training procedures to mitigate the ‘oldness’ bias. The potential for inferring document-level labels from passage-level labels was also explored as a way to create more complete test collections, suggesting a hybrid evaluation dataset combining inferred and actual labels for future tracks.

In conclusion, the TREC 2021 Deep Learning Track reinforced the strong performance of pre-trained deep neural models in information retrieval, while also shedding light on the complexities of large-scale dataset creation, judgment completeness, and evaluation methodologies in this rapidly evolving field.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Deep Learning Continues to Lead in Information Retrieval: Insights from TREC 2021

Key Tasks and Evaluation

The Evolution of MS MARCO Datasets

Performance Trends: Neural vs. Traditional Methods

Challenges and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates