TLDR: This research proposes a novel direct-scoring method for evaluating AI-generated text (summaries, dialogue, stories) that uses synthetic, quality-graded examples for pairwise comparisons. It achieves performance comparable to state-of-the-art pairwise evaluators on benchmarks like SummEval, TopicalChat, and HANNA, offering the crucial ability to assign absolute scores for filtering and sorting, a limitation of traditional comparison-based methods.
Evaluating the quality of text generated by large language models (LLMs) has become a significant challenge as these models advance. Traditional methods often struggle to capture the nuances of human judgment, leading to a demand for more sophisticated evaluation techniques. This research introduces a novel approach that combines the strengths of direct scoring with the effectiveness of pairwise comparisons, offering a more robust way to assess AI-generated content.
Historically, evaluating free-form content like summaries, dialogue, or stories generated by LLMs has relied on methods that compare n-gram overlap or use smaller, pre-trained language models. While these had their uses, they frequently fell short in complex scenarios and didn’t always align well with human perceptions of quality. More recently, LLMs themselves have been employed as evaluators, particularly comparison-based methods, which have shown strong alignment with human judgment.
However, a key limitation of comparison-based approaches is their inability to assign absolute scores to individual pieces of text. This is crucial for applications that require setting thresholds, such as filtering out low-quality content or sorting outputs by quality. The paper, “Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too,” addresses this gap by proposing a direct-scoring method that cleverly integrates pairwise comparisons using synthetically generated examples.
How the New Method Works
The core of this innovative method involves two main steps. First, it creates a set of “synthetic in-context examples” – essentially, machine-generated summaries of varying quality levels (e.g., from worst to best). This is achieved by prompting an LLM to generate summaries that reflect specific quality dimensions like consistency, coherence, relevance, or fluency. The process starts by generating the extreme examples (worst and best) and then recursively creates intermediate quality summaries.
Second, at evaluation time, a machine-generated summary is compared against these synthetic examples. Instead of directly predicting an absolute score, the LLM calculates the probability of the machine summary being “Better,” “Worse,” or “Similar” to each synthetic example. These probabilities are then used to compute a weighted average, resulting in an absolute score for the machine-generated text. This approach differs from previous direct-scoring methods that directly predict scores, instead leveraging the LLM’s comparative judgment ability.
Key Findings and Performance
The researchers tested their method on three major meta-evaluation benchmarks: SummEval for summarization, TopicalChat for dialogue, and HANNA for story generation. The results demonstrate that this new direct-scoring method performs comparably to state-of-the-art pairwise evaluators in terms of sample-level correlations with human judgment. For instance, it showed the best average performance on SummEval and HANNA, and second-best on TopicalChat.
Ablation studies further revealed that the use of probability generation (predicting “Better,” “Worse,” “Similar”) is crucial for the method’s performance. Increasing the number of synthetic examples used for comparison also moderately boosted performance. Furthermore, the study found that using more powerful LLMs as backbones for both generating synthetic examples and making predictions generally led to better evaluation performance, especially for models with strong instruction-following capabilities.
Also Read:
- Assessing Topic Model Quality with Large Language Models: A New Framework
- Unpacking Writing Scores: A New Approach to Explaining Automated Evaluations
Implications for AI Text Evaluation
This research offers a significant step forward for evaluating natural language generation. By providing a direct-scoring metric that rivals the performance of comparison-based approaches, it opens up new possibilities for use cases requiring absolute scores, such as filtering, sorting, and thresholding AI-generated content. The researchers have also made their synthetic summaries, code, and prompts publicly available to support future work in this area. You can find more details about this research at the research paper link.


