spot_img
HomeResearch & DevelopmentA New Way to Score AI-Generated Text: Combining Direct...

A New Way to Score AI-Generated Text: Combining Direct Assessment with Smart Comparisons

TLDR: This research proposes a novel direct-scoring method for evaluating AI-generated text (summaries, dialogue, stories) that uses synthetic, quality-graded examples for pairwise comparisons. It achieves performance comparable to state-of-the-art pairwise evaluators on benchmarks like SummEval, TopicalChat, and HANNA, offering the crucial ability to assign absolute scores for filtering and sorting, a limitation of traditional comparison-based methods.

Evaluating the quality of text generated by large language models (LLMs) has become a significant challenge as these models advance. Traditional methods often struggle to capture the nuances of human judgment, leading to a demand for more sophisticated evaluation techniques. This research introduces a novel approach that combines the strengths of direct scoring with the effectiveness of pairwise comparisons, offering a more robust way to assess AI-generated content.

Historically, evaluating free-form content like summaries, dialogue, or stories generated by LLMs has relied on methods that compare n-gram overlap or use smaller, pre-trained language models. While these had their uses, they frequently fell short in complex scenarios and didn’t always align well with human perceptions of quality. More recently, LLMs themselves have been employed as evaluators, particularly comparison-based methods, which have shown strong alignment with human judgment.

However, a key limitation of comparison-based approaches is their inability to assign absolute scores to individual pieces of text. This is crucial for applications that require setting thresholds, such as filtering out low-quality content or sorting outputs by quality. The paper, “Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too,” addresses this gap by proposing a direct-scoring method that cleverly integrates pairwise comparisons using synthetically generated examples.

How the New Method Works

The core of this innovative method involves two main steps. First, it creates a set of “synthetic in-context examples” – essentially, machine-generated summaries of varying quality levels (e.g., from worst to best). This is achieved by prompting an LLM to generate summaries that reflect specific quality dimensions like consistency, coherence, relevance, or fluency. The process starts by generating the extreme examples (worst and best) and then recursively creates intermediate quality summaries.

Second, at evaluation time, a machine-generated summary is compared against these synthetic examples. Instead of directly predicting an absolute score, the LLM calculates the probability of the machine summary being “Better,” “Worse,” or “Similar” to each synthetic example. These probabilities are then used to compute a weighted average, resulting in an absolute score for the machine-generated text. This approach differs from previous direct-scoring methods that directly predict scores, instead leveraging the LLM’s comparative judgment ability.

Key Findings and Performance

The researchers tested their method on three major meta-evaluation benchmarks: SummEval for summarization, TopicalChat for dialogue, and HANNA for story generation. The results demonstrate that this new direct-scoring method performs comparably to state-of-the-art pairwise evaluators in terms of sample-level correlations with human judgment. For instance, it showed the best average performance on SummEval and HANNA, and second-best on TopicalChat.

Ablation studies further revealed that the use of probability generation (predicting “Better,” “Worse,” “Similar”) is crucial for the method’s performance. Increasing the number of synthetic examples used for comparison also moderately boosted performance. Furthermore, the study found that using more powerful LLMs as backbones for both generating synthetic examples and making predictions generally led to better evaluation performance, especially for models with strong instruction-following capabilities.

Also Read:

Implications for AI Text Evaluation

This research offers a significant step forward for evaluating natural language generation. By providing a direct-scoring metric that rivals the performance of comparison-based approaches, it opens up new possibilities for use cases requiring absolute scores, such as filtering, sorting, and thresholding AI-generated content. The researchers have also made their synthetic summaries, code, and prompts publicly available to support future work in this area. You can find more details about this research at the research paper link.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -