TLDR: A new research paper introduces a reliable evaluation protocol for low-precision retrieval systems, addressing the issue of ‘spurious ties’ that arise when relevance scores are computed with reduced numerical precision. The protocol comprises High-Precision Scoring (HPS), which upcasts only the final scoring step to higher precision to resolve ties efficiently, and Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty. Experiments demonstrate that this combined approach significantly reduces evaluation instability and accurately reflects true model performance, preventing misleading conclusions from conventional tie-oblivious metrics.
In the rapidly evolving world of artificial intelligence, particularly in retrieval systems, there’s a constant push to make models more efficient. One popular method involves lowering the numerical precision of model parameters and computations. This approach, often referred to as low-precision techniques like quantization and compression, significantly boosts efficiency and scalability while reducing computational costs. However, a recent research paper highlights a critical challenge that arises when these low-precision methods are applied to calculating relevance scores between a query and documents: the emergence of ‘spurious ties’.
The Problem of Spurious Ties
When models operate at lower numerical precision (for example, moving from FP32 to FP16 or BF16), the granularity of representable floating-point numbers is reduced. This means that many distinct relevance scores, which would normally be unique, get rounded to the same value, creating these ‘spurious ties’. Imagine a scenario where multiple documents are equally relevant to a query, but due to low precision, they all end up with the exact same score. Current evaluation systems often handle these ties arbitrarily, perhaps by document ID, which introduces high variability and makes the evaluation results unreliable. This can lead to misleading conclusions about a model’s true performance, even causing a seemingly superior model to actually be inferior when ties are properly accounted for.
Introducing a Reliable Evaluation Protocol
To tackle this significant issue, researchers have proposed a more robust retrieval evaluation protocol. This protocol consists of two key components designed to reduce score variation and provide a more accurate assessment of low-precision retrieval systems:
1. High-Precision Scoring (HPS)
HPS is a clever solution that addresses spurious ties with minimal computational overhead. Instead of performing the entire model’s computations in high precision, HPS only upcasts the final scoring step to a higher precision format, such as FP32. This means the bulk of the model’s operations remain in the efficient low-precision format, preserving latency and memory savings. By upcasting just the final score, HPS effectively ‘collapses’ large tie groups, allowing for a more fine-grained distinction between candidates that would otherwise appear identical. This dramatically reduces the instability caused by ties and restores a more deterministic, high-precision-like sorting of results.
2. Tie-aware Retrieval Metrics (TRM)
TRM complements HPS by providing a more comprehensive way to report evaluation scores. Traditional tie-oblivious methods simply truncate ranked lists based on an arbitrary order, leading to unpredictable results. TRM, on the other hand, offers:
-
Expected Score: This metric calculates the average performance value across all possible orderings of tied candidates, effectively removing the randomness introduced by arbitrary tie-breaking.
-
Score Range: TRM quantifies the uncertainty due to unresolved internal orderings by reporting the maximum and minimum achievable scores. A smaller range indicates more stable and reliable results.
-
Score Bias: This measures the difference between the conventional tie-oblivious metric and the expected score. A large positive bias, for instance, indicates that the tie-oblivious evaluation is overestimating the model’s performance.
Experimental Validation
The researchers conducted extensive experiments using various models, including Qwen3-Reranker and multilingual-e5-large, across different scoring functions (softmax, sigmoid, pairwise product) and datasets like MIRACLReranking and AskUbuntuDupQuestions. The results were compelling. They observed that evaluating low-precision models with conventional tie-oblivious metrics indeed led to significant uncertainty and misleading outcomes. For example, one model appeared to outperform another in BF16 evaluation, but the tie-aware metric revealed the opposite. Furthermore, tie-oblivious metrics consistently overestimated performance, sometimes by a substantial margin.
However, when HPS was applied, the score ranges shrank dramatically, often by an order of magnitude, bringing stability close to that of full FP32 inference but with negligible extra cost. The combination of HPS and TRM provided a consistent and discriminative framework for evaluating retrieval models in low-precision settings, accurately reflecting true model performance and exposing biases inherent in naive evaluation methods.
Also Read:
- Optimizing Machine Learning Evaluation: The Crucial Balance of Data Items and Human Annotations
- Beyond Pass/Fail: A New Approach to Evaluating Machine Learning Errors with Hierarchical Scoring
Conclusion and Future Outlook
This research demonstrates that addressing tied candidates is crucial for reliable evaluation of low-precision retrieval systems. The proposed High-Precision Scoring and Tie-aware Retrieval Metrics offer a practical and effective solution. This protocol not only mitigates spurious ties across different precision formats but also provides a more dependable alternative to previous naive methods. Ultimately, this enables more stable document retrieval in critical applications like retrieval-augmented generation (RAG), all while preserving the efficiency and memory benefits that low-precision models offer.
For more technical details, you can refer to the full research paper: Reliable Evaluation Protocol for Low-Precision Retrieval.


