TLDR: The TREC 2022 Deep Learning track focused on creating a more reusable test collection for passage ranking by introducing new, harder queries, prioritizing passage judgments, and implementing near-duplicate detection. Deep neural models with large-scale pretraining continued to significantly outperform traditional methods, and full-ranking approaches showed substantial gains. The changes successfully addressed reusability and saturation issues observed in previous years.
The TREC Deep Learning track, now in its fourth year, continues to be a crucial benchmark for ad hoc retrieval methods, particularly in environments with vast amounts of data. The 2022 iteration introduced significant changes aimed at enhancing the reusability and quality of test collections, primarily focusing on the passage retrieval task.
A core aspect of the track involves leveraging the MS MARCO datasets, which provide hundreds of thousands of human-annotated training labels for both passage and document ranking. Following a substantial refresh in the previous year, the passage collection grew nearly 16 times and the document collection four times. For 2022, the main objective was to build a more complete test collection for passage retrieval, which became the primary focus. The document ranking task was maintained as a secondary objective, with document-level labels derived from passage-level annotations.
One of the key findings, consistent with previous years, is that deep neural ranking models, especially those employing large-scale pretraining, continue to significantly outperform traditional retrieval methods. Interestingly, the 2022 results also showed some unexpected outcomes: some top-performing systems did not rely on dense retrieval, and single-stage dense retrieval runs were less competitive compared to the previous year.
Key Changes for 2022
The substantial increase in collection sizes in 2021 led to a corresponding rise in relevant results per query, straining the existing judgment budget and raising concerns about test collection reusability and score saturation. To address these issues, three major changes were implemented in 2022:
-
New Test Queries: Queries were selected that did not contribute to the original MS MARCO corpus. Unlike previous evaluations where Bing’s top-10 results for test queries were included in the corpus, this year’s queries were processed after the MS MARCO dataset was finalized. This makes the task more realistic, as systems are not guaranteed to have relevant Bing results pre-indexed.
-
Focused Judging: NIST judges primarily evaluated the relevance of retrieved results for the passage ranking task. These passage-level labels were then propagated to their source documents for the document ranking task, effectively doubling the passage judgment budget.
-
Near-Duplicate Detection: A process was introduced to detect near-duplicate passages. Only one representative passage from each near-duplicate cluster was judged, and its relevance label was applied to all other passages within that cluster. This prevents redundant judging and optimizes resource allocation.
These changes aimed to reduce the number of relevant results per query and overall judgment costs, leading to a more complete and reusable test collection. The organizers expressed greater confidence in the quality of this year’s queries and judgments, particularly in their ability to differentiate between various system performances.
Task Descriptions and Datasets
The 2022 track featured two main tasks: Passage ranking and Document ranking. Both tasks included a full ranking subtask (retrieving from the entire collection) and a top-100 reranking subtask (re-ranking a provided initial list of 100 candidates). Participants could submit multiple runs and were asked to categorize their models (e.g., ‘trad’ for traditional, ‘nn’ for neural without pre-trained models, ‘nnlm’ for neural with pre-trained models).
The MS MARCO v2 dataset was utilized, which significantly expanded upon the v1 dataset. While v1 was passage-centric, v2 is document-native, starting with documents and then identifying promising passages within them. This change also allowed participants to leverage passage-to-document mappings, which was previously forbidden due to corpus construction biases.
Also Read:
- Deep Learning Continues to Lead in Information Retrieval: Insights from TREC 2021
- TREC 2023 Deep Learning Track Concludes: LLMs Emerge as Top Performers in Ranking Tasks
Performance Insights
The track saw 14 participating groups submitting a total of 142 runs. A notable trend was the continued dominance of ‘nnlm’ (neural with pre-trained models) runs, which constituted 85% of submissions and dramatically outperformed ‘trad’ (traditional) runs. For passage ranking, the best ‘nnlm’ run showed a 125% improvement in NDCG@10 over the best ‘trad’ run, a significant increase from previous years. Similarly, for document ranking, the gap was 76%.
The ‘fullrank’ runs also showed substantial improvements over ‘rerank’ runs this year, with a 36% NDCG@10 improvement for passage ranking and a 125% improvement for document ranking. This suggests progress in end-to-end retrieval systems and potentially less focus on optimizing reranking approaches this year.
The impact of near-duplicate detection was analyzed, confirming that while it was helpful in optimizing judging resources, the elimination of separate document judging was crucial for achieving a sufficiently complete and reusable test collection. The 2022 passage collection successfully met the relevance density threshold, indicating that judgments were sufficiently complete to reliably evaluate new systems.
For more detailed information, you can refer to the full research paper available at arXiv:2507.10865.


