TREC 2022 Deep Learning Track: Advancing Reusable Test Collections and Neural Retrieval

TLDR: The TREC 2022 Deep Learning track focused on creating a more reusable test collection for passage ranking by introducing new, harder queries, prioritizing passage judgments, and implementing near-duplicate detection. Deep neural models with large-scale pretraining continued to significantly outperform traditional methods, and full-ranking approaches showed substantial gains. The changes successfully addressed reusability and saturation issues observed in previous years.

The TREC Deep Learning track, now in its fourth year, continues to be a crucial benchmark for ad hoc retrieval methods, particularly in environments with vast amounts of data. The 2022 iteration introduced significant changes aimed at enhancing the reusability and quality of test collections, primarily focusing on the passage retrieval task.

A core aspect of the track involves leveraging the MS MARCO datasets, which provide hundreds of thousands of human-annotated training labels for both passage and document ranking. Following a substantial refresh in the previous year, the passage collection grew nearly 16 times and the document collection four times. For 2022, the main objective was to build a more complete test collection for passage retrieval, which became the primary focus. The document ranking task was maintained as a secondary objective, with document-level labels derived from passage-level annotations.

One of the key findings, consistent with previous years, is that deep neural ranking models, especially those employing large-scale pretraining, continue to significantly outperform traditional retrieval methods. Interestingly, the 2022 results also showed some unexpected outcomes: some top-performing systems did not rely on dense retrieval, and single-stage dense retrieval runs were less competitive compared to the previous year.

Key Changes for 2022

The substantial increase in collection sizes in 2021 led to a corresponding rise in relevant results per query, straining the existing judgment budget and raising concerns about test collection reusability and score saturation. To address these issues, three major changes were implemented in 2022:

New Test Queries: Queries were selected that did not contribute to the original MS MARCO corpus. Unlike previous evaluations where Bing’s top-10 results for test queries were included in the corpus, this year’s queries were processed after the MS MARCO dataset was finalized. This makes the task more realistic, as systems are not guaranteed to have relevant Bing results pre-indexed.
Focused Judging: NIST judges primarily evaluated the relevance of retrieved results for the passage ranking task. These passage-level labels were then propagated to their source documents for the document ranking task, effectively doubling the passage judgment budget.
Near-Duplicate Detection: A process was introduced to detect near-duplicate passages. Only one representative passage from each near-duplicate cluster was judged, and its relevance label was applied to all other passages within that cluster. This prevents redundant judging and optimizes resource allocation.

These changes aimed to reduce the number of relevant results per query and overall judgment costs, leading to a more complete and reusable test collection. The organizers expressed greater confidence in the quality of this year’s queries and judgments, particularly in their ability to differentiate between various system performances.

Task Descriptions and Datasets

The 2022 track featured two main tasks: Passage ranking and Document ranking. Both tasks included a full ranking subtask (retrieving from the entire collection) and a top-100 reranking subtask (re-ranking a provided initial list of 100 candidates). Participants could submit multiple runs and were asked to categorize their models (e.g., ‘trad’ for traditional, ‘nn’ for neural without pre-trained models, ‘nnlm’ for neural with pre-trained models).

The MS MARCO v2 dataset was utilized, which significantly expanded upon the v1 dataset. While v1 was passage-centric, v2 is document-native, starting with documents and then identifying promising passages within them. This change also allowed participants to leverage passage-to-document mappings, which was previously forbidden due to corpus construction biases.

Also Read:

Performance Insights

The track saw 14 participating groups submitting a total of 142 runs. A notable trend was the continued dominance of ‘nnlm’ (neural with pre-trained models) runs, which constituted 85% of submissions and dramatically outperformed ‘trad’ (traditional) runs. For passage ranking, the best ‘nnlm’ run showed a 125% improvement in NDCG@10 over the best ‘trad’ run, a significant increase from previous years. Similarly, for document ranking, the gap was 76%.

The ‘fullrank’ runs also showed substantial improvements over ‘rerank’ runs this year, with a 36% NDCG@10 improvement for passage ranking and a 125% improvement for document ranking. This suggests progress in end-to-end retrieval systems and potentially less focus on optimizing reranking approaches this year.

The impact of near-duplicate detection was analyzed, confirming that while it was helpful in optimizing judging resources, the elimination of separate document judging was crucial for achieving a sufficiently complete and reusable test collection. The 2022 passage collection successfully met the relevance density threshold, indicating that judgments were sufficiently complete to reliably evaluate new systems.

For more detailed information, you can refer to the full research paper available at arXiv:2507.10865.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

TREC 2022 Deep Learning Track: Advancing Reusable Test Collections and Neural Retrieval

Key Changes for 2022

Task Descriptions and Datasets

Performance Insights

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates