TLDR: A new research paper introduces a method to estimate Large Language Model (LLM) errors in pairwise text comparisons without needing ground truth. It identifies uniform error rates and positional biases, using a Copeland ranking to show that LLM-based rankings are not scalable. Experiments with six LLMs and five text types reveal that LLMs are more error-prone with meaningless content. Claude demonstrated the most desirable performance, balancing low error rates and prompt robustness. The method offers a more coherent error estimation than existing models, highlighting LLM limitations in consistent judgment and prompt sensitivity.
Large Language Models (LLMs) have become incredibly powerful tools, capable of generating human-like text, answering complex questions, and even assisting with creative tasks. However, like any advanced technology, they are not without their flaws. One significant challenge is their propensity for errors, often referred to as ‘hallucinations,’ where they generate factually incorrect or inconsistent information. This problem becomes particularly tricky to evaluate when there’s no clear ‘ground truth’ to compare their output against.
A recent research paper, titled “Estimating the Error of Large Language Models at Pairwise Text Comparison,” by Tianyi Li from the Department of Decisions, Operations and Technology at CUHK, delves into a novel method for quantifying these errors, specifically when LLMs are asked to compare two pieces of text and express a preference. This study is crucial because it offers a way to understand LLM reliability without needing a perfect, human-verified answer for every comparison.
The Challenge of LLM Comparisons
When an LLM compares two texts, say Text 1 and Text 2, it might indicate a preference. But how accurate is that preference? And what if the order of the texts matters? The paper highlights a phenomenon called ‘positional bias,’ where an LLM might unfairly favor the text placed first in a comparison, regardless of its actual quality. This is similar to biases observed in human decision-making or other machine learning systems.
Another major issue is scalability. Imagine trying to rank a large number of texts by comparing them pairwise. The number of comparisons grows very quickly, and the paper suggests that the reliability of the resulting ranking deteriorates as more objects are introduced. This means that while LLMs can compare a few items well, their ability to create a consistent ranking from many comparisons becomes poor.
A New Method for Error Estimation
The research proposes a method to measure LLM errors in two main scenarios:
-
Uniform Error Rate: This scenario assumes that the LLM has a consistent probability of making an error, regardless of which text is presented first. To estimate this, each pair of texts is compared twice, with their order swapped, and the results are combined.
-
Binary Positional Bias: This is a more nuanced scenario, acknowledging that the LLM might have different error rates depending on whether the ‘better’ text is placed first or second. To estimate these distinct error rates, repeated comparisons between the same two texts are conducted.
The study uses a technique called ‘Copeland counting’ to construct a ranking from the LLM’s pairwise preferences. By analyzing how this ranking deviates from a theoretically perfect one, the researchers can estimate the LLM’s error rates. A key finding from this approach is that the Copeland ranking, when based on LLM preferences, is indeed not scalable; its accuracy decreases as the number of texts to be ranked increases.
Experiments and Key Findings
The method was applied to six popular LLMs: ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen. These LLMs were tasked with comparing five different types of text inputs:
-
Pseudo-word paragraphs (meaningless text)
-
Pseudo paragraphs (random English words)
-
Advertising slogans
-
Short poems
-
Academic abstracts
The results were quite revealing. Unsurprisingly, LLMs were found to be more error-prone when comparing meaningless content (like pseudo-word paragraphs) where there’s no clear ‘better’ option. For meaningful texts, the error rates were generally lower.
Among the tested LLMs, Claude emerged with the most desirable performance, demonstrating a good balance of low error rates and robustness to variations in the prompts given to the LLM. Qwen also performed well in terms of raw error rates but was less consistent when the prompts were slightly changed. Gemini, on the other hand, generally exhibited the highest error rates.
Interestingly, the study found that the two positional bias terms (favoring first or second position) were often quite similar to the uniform error rate, especially for meaningful text types. This suggests that while positional bias exists, its impact might not always be drastically different from a general error rate.
The paper also compared its method against existing techniques like ‘commutativity scores’ and a ‘biased Bradley-Terry model.’ The researchers concluded that their approach provided more coherent and indicative estimates of LLM errors.
Also Read:
- New Study Explores AI’s Ability to Grade Academic Papers
- Unveiling LLM’s Inner Workings: How Fractal Geometry Measures Text Complexity
Implications for the Future
This research provides a valuable, ground-truth-free framework for understanding and quantifying errors in LLM-based pairwise text comparisons. It highlights the inherent limitations of LLMs in tasks requiring consistent judgment across many items and their sensitivity to prompt variations. While Claude showed promising performance in this experiment, the authors caution against overstating any single LLM’s superiority, emphasizing the need for more extensive testing across diverse scenarios and prompt engineering strategies.
Ultimately, understanding these errors is a crucial step towards building more reliable and trustworthy AI systems. For more details, you can read the full research paper here.


