TLDR: Retro* is a novel LLM-based method for reasoning-intensive document retrieval that uses a rubric-based scoring system for interpretable relevance, test-time scaling for accuracy, and a two-stage reinforcement learning strategy for optimization. It achieves state-of-the-art performance on complex benchmarks and offers efficient, parallel processing for real-world applications.
In the evolving landscape of artificial intelligence, Large Language Models (LLMs) are becoming increasingly adept at tackling complex tasks, from software engineering to scientific research. A key component enabling their success is Retrieval-Augmented Generation (RAG), which allows LLMs to access external knowledge. However, a significant challenge arises when the connection between a task and the necessary documents is indirect or implicit, requiring advanced reasoning to identify relevant information.
Addressing this challenge, a new approach called Retro* has been introduced. Developed by Junwei Lan, Jianlyu Chen, Zheng Liu, Chaofan Li, Siqi Bao, and Defu Lian, Retro* is designed to optimize LLMs for these “reasoning-intensive” document retrieval tasks. The core idea is to move beyond simple keyword matching and enable LLMs to understand deeper, more abstract connections between a query and potential documents.
Understanding Retro*’s Core Innovations
Retro* stands out with two primary design principles:
Rubric-Based Relevance Scoring: Unlike traditional methods that often provide only a relative ranking of documents, Retro* introduces a system where relevance is measured directly and interpretably. It uses a set of predefined “rubrics” or criteria to guide the LLM in reasoning about the relationship between a task (query) and a document. This process generates a fine-grained relevance score, typically an integer between 0 and 100, with clear meanings for different score ranges (e.g., 80-100 for “Highly Relevant”). This allows for a more precise understanding of how useful a document truly is.
Test-Time Scaling via Score Integration: To further enhance accuracy and reliability, Retro* employs a technique called test-time scaling. For each query-document pair, the model generates multiple reasoning paths or “trajectories.” It then integrates the scores from these multiple trajectories to produce a more stable and reliable estimate of the document’s relevance. This is akin to getting several expert opinions and combining them for a more robust judgment.
How Retro* Learns: A Two-Stage Training Strategy
To maximize Retro*’s reasoning capabilities, the researchers developed a novel two-stage training strategy:
1. Supervised Fine-Tuning (SFT): In the initial stage, the model is “warmed up” using supervised fine-tuning. This involves training the model on high-quality data where a powerful “teacher model” has already reasoned about query-document relevance based on the rubrics. This stage equips Retro* with foundational reasoning skills and teaches it to generate concise and well-structured thoughts.
2. Reinforcement Learning (RL): Following SFT, the model undergoes a reinforcement learning stage. Here, a unique “composite reward” system is used to further refine Retro*’s abilities. This reward system has two parts:
- Intra-Document Reward: This reward encourages the model to produce consistent and accurate relevance scores for the same document across multiple attempts, improving the stability of its individual document scoring.
- Inter-Document Reward: This reward incentivizes the model to correctly rank documents by assigning higher scores to more relevant ones compared to less relevant ones. This helps the model discriminate effectively between positive and negative examples.
By combining these two rewards, Retro* is optimized to both accurately score individual documents and correctly rank them within a set of candidates.
Also Read:
- KG-R1: A Unified Agent for Efficient and Adaptable Knowledge Graph Reasoning
- Enhancing Language Models with Critical Thinking for Accurate Answers
Performance and Efficiency
Experiments on the BRIGHT benchmark, which includes 12 datasets across science, mathematics, and programming, show that Retro* achieves state-of-the-art performance. It significantly outperforms existing document retrieval methods. The test-time scaling mechanism further boosts its accuracy, with a 7B parameter model with scaling even surpassing a standard 32B model without it.
Furthermore, Retro* demonstrates excellent efficiency due to its “pointwise” approach. Unlike “listwise” or “setwise” methods that process documents sequentially, Retro* evaluates each query-document pair independently. This allows for massive parallelism, meaning it can process many documents simultaneously, leading to significantly lower inference times, especially when dealing with a large number of candidate documents. This makes Retro* a highly practical and scalable solution for real-world applications.
The research also highlights that Retro*’s rubric-based scoring provides a clear and interpretable measure of relevance, effectively separating highly relevant documents from irrelevant ones, a functionality often lacking in other models. For more technical details, you can refer to the full research paper here.


