TLDR: The TREC 2023 Deep Learning Track, in its final year, focused on benchmarking ad hoc retrieval methods using MS MARCO datasets. A key finding was that runs employing Large Language Model (LLM) prompting significantly outperformed previous best approaches (‘nnlm’). The track also explored the use of synthetic queries generated by T5 and GPT-4, finding them reliable for evaluating system performance, despite some minor biases. The track successfully created another reusable test collection with more challenging queries and refined judgment processes.
The TREC 2023 Deep Learning Track recently concluded its fifth and final year, marking a significant milestone in the benchmarking of ad hoc retrieval methods. This year’s track continued its focus on leveraging the extensive MS MARCO datasets, which provide hundreds of thousands of human-annotated training labels for both passage and document ranking tasks. A key highlight from this year’s findings is the emergence of Large Language Model (LLM) prompting as a superior approach, outperforming the ‘nnlm’ (neural network language model) methods that had previously dominated the track for four years.
The track maintained a design similar to the previous year, aiming to create another consistent test set. This involved using the larger, cleaner, and less-biased v2 passage and document collections. Passage ranking remained the primary task, with document ranking as a secondary task, where labels were inferred from passage judgments. A notable change from earlier years was the use of completely held-out MS MARCO queries for testing, which were not used in corpus construction. This approach resulted in more challenging tests, providing greater room for improvement in retrieval systems.
Exploring New Query Types and Evaluation
Beyond traditional human-generated MS MARCO queries, the TREC 2023 track introduced synthetic queries. These were generated using a fine-tuned T5 model and a GPT-4 prompt. Human relevance assessments were applied to all query types, including these synthetic ones. Interestingly, evaluations using synthetic queries yielded results similar to those from human queries, with a high system ordering agreement (Ï„ = 0.8487). However, human effort was still necessary to select a usable subset of these synthetic queries. The analysis did not reveal clear evidence of bias where runs using GPT-4 were favored when evaluated with synthetic GPT-4 queries, or T5-based runs with synthetic T5 queries.
Task Breakdown: Passage and Document Ranking
Similar to previous years, the 2023 Deep Learning Track featured two main tasks: passage ranking and document ranking. Participants could submit up to three official runs for each task, along with additional baseline runs. The track emphasized generating reusable test collections, and this year primarily focused on rerunning the track under the same setup and datasets as the previous year. The passage ranking task was the primary focus for constructing a complete and reusable test collection, with document ranking labels inferred from passage-level judgments.
A new query sampling methodology, introduced last year, was continued to ensure queries were more difficult, preventing all runs from achieving equally high performance and making evaluations more discriminative. This methodology involved using Bing queries that were processed through a classifier to identify those answerable by a short passage, and then further filtered by annotators. The key difference was that these queries were from a later batch of MS MARCO data, not the initial one million queries used for corpus construction, meaning they were truly held-out.
The final set of 700 test queries for both tasks comprised human queries, T5-generated queries, and GPT-4 generated queries. NIST assessors played a crucial role in filtering out unreasonable queries or those with too few or too many relevant documents. For instance, out of 147 real queries, 51 were selected, while 13 out of 48 T5-generated queries and 18 out of 49 GPT-4 generated queries were included.
Dataset Evolution: From v1 to v2
The track continued to utilize the MS MARCO v2 dataset, which represents a significant evolution from its predecessors. The original MS MARCO dataset (2016) was designed for a natural language generation task, processing one million Bing queries and involving crowd workers to generate answers from passages. The v1 ranking datasets (2018) adapted this data for ranking tasks, creating passage and document collections. However, v1 had limitations, such as corpus generation being tied to queries and issues with character sets and whitespace.
The v2 datasets, first used in TREC 2021, aimed to increase scale and introduce a wider variety of documents. Unlike v1, v2 is document-native, starting with 11.9 million documents (2.7 million from v1 sources plus 9.2 million new ones). From these documents, a proprietary algorithm identified promising passages, resulting in 138 million passages in the v2 corpus. A crucial improvement in v2 is that participants are now allowed to use the passage-document mapping, enabling more sophisticated ranking approaches. The new dataset also addressed character encoding and whitespace issues, providing a cleaner foundation for future tasks.
Also Read:
- Deep Learning Continues to Lead in Information Retrieval: Insights from TREC 2021
- Unlocking Deeper Intelligence: The Convergence of Retrieval and Reasoning in Advanced LLM Systems
Performance Insights: LLMs Take the Lead
Six groups participated in the TREC 2023 Deep Learning Track, submitting 40 runs across passage and document ranking tasks. Runs were categorized into ‘trad’ (no neural representation learning), ‘nn’ (representation learning without pre-trained models), ‘nnlm’ (using pre-trained models like BERT), and ‘prompt’ (using LLMs with prompting). The evaluation, primarily using NDCG@10, NDCG@100, and Average Precision, revealed a clear trend: ‘prompt’ runs significantly outperformed ‘nnlm’ runs, mirroring how ‘nnlm’ runs previously surpassed ‘trad’ methods. This suggests a new performance frontier opened by LLMs in information retrieval.
While LLM-based prompting often implies few-shot or zero-shot learning, most top-performing ‘prompt’ runs still reported using MS MARCO training data. This indicates that while LLMs are powerful, fine-tuning other components of the ranking stack with MS MARCO data remains beneficial. The paper also notes that expensive models like GPT-4 were likely used in later reranking stages, after initial candidate generation by ‘trad’ or ‘nnlm’ methods, to refine the ranking of a smaller set of results.
The track’s final year successfully built another reusable dataset, incorporating the advancements from the previous year, such as harder queries, focused passage judgments, and passage deduplication. The initial analysis confirms that synthetic queries, particularly those from T5 and GPT-4, can reliably contribute to test collection construction, yielding evaluation outcomes similar to those from real user queries. The continued outperformance of deep learning models with large-scale pretraining over traditional methods, and the significant gains from prompt-based LLM approaches, underscore the evolving landscape of information retrieval research. For more detailed information, you can refer to the full research paper available at arXiv:2507.08890.


