A New Way to Score AI-Generated Text: Combining Direct Assessment with Smart Comparisons

TLDR: This research proposes a novel direct-scoring method for evaluating AI-generated text (summaries, dialogue, stories) that uses synthetic, quality-graded examples for pairwise comparisons. It achieves performance comparable to state-of-the-art pairwise evaluators on benchmarks like SummEval, TopicalChat, and HANNA, offering the crucial ability to assign absolute scores for filtering and sorting, a limitation of traditional comparison-based methods.

Evaluating the quality of text generated by large language models (LLMs) has become a significant challenge as these models advance. Traditional methods often struggle to capture the nuances of human judgment, leading to a demand for more sophisticated evaluation techniques. This research introduces a novel approach that combines the strengths of direct scoring with the effectiveness of pairwise comparisons, offering a more robust way to assess AI-generated content.

Historically, evaluating free-form content like summaries, dialogue, or stories generated by LLMs has relied on methods that compare n-gram overlap or use smaller, pre-trained language models. While these had their uses, they frequently fell short in complex scenarios and didn’t always align well with human perceptions of quality. More recently, LLMs themselves have been employed as evaluators, particularly comparison-based methods, which have shown strong alignment with human judgment.

However, a key limitation of comparison-based approaches is their inability to assign absolute scores to individual pieces of text. This is crucial for applications that require setting thresholds, such as filtering out low-quality content or sorting outputs by quality. The paper, “Direct-Scoring NLG Evaluators Can Use Pairwise Comparisons Too,” addresses this gap by proposing a direct-scoring method that cleverly integrates pairwise comparisons using synthetically generated examples.

How the New Method Works

The core of this innovative method involves two main steps. First, it creates a set of “synthetic in-context examples” – essentially, machine-generated summaries of varying quality levels (e.g., from worst to best). This is achieved by prompting an LLM to generate summaries that reflect specific quality dimensions like consistency, coherence, relevance, or fluency. The process starts by generating the extreme examples (worst and best) and then recursively creates intermediate quality summaries.

Second, at evaluation time, a machine-generated summary is compared against these synthetic examples. Instead of directly predicting an absolute score, the LLM calculates the probability of the machine summary being “Better,” “Worse,” or “Similar” to each synthetic example. These probabilities are then used to compute a weighted average, resulting in an absolute score for the machine-generated text. This approach differs from previous direct-scoring methods that directly predict scores, instead leveraging the LLM’s comparative judgment ability.

Key Findings and Performance

The researchers tested their method on three major meta-evaluation benchmarks: SummEval for summarization, TopicalChat for dialogue, and HANNA for story generation. The results demonstrate that this new direct-scoring method performs comparably to state-of-the-art pairwise evaluators in terms of sample-level correlations with human judgment. For instance, it showed the best average performance on SummEval and HANNA, and second-best on TopicalChat.

Ablation studies further revealed that the use of probability generation (predicting “Better,” “Worse,” “Similar”) is crucial for the method’s performance. Increasing the number of synthetic examples used for comparison also moderately boosted performance. Furthermore, the study found that using more powerful LLMs as backbones for both generating synthetic examples and making predictions generally led to better evaluation performance, especially for models with strong instruction-following capabilities.

Also Read:

Implications for AI Text Evaluation

This research offers a significant step forward for evaluating natural language generation. By providing a direct-scoring metric that rivals the performance of comparison-based approaches, it opens up new possibilities for use cases requiring absolute scores, such as filtering, sorting, and thresholding AI-generated content. The researchers have also made their synthetic summaries, code, and prompts publicly available to support future work in this area. You can find more details about this research at the research paper link.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

A New Way to Score AI-Generated Text: Combining Direct Assessment with Smart Comparisons

How the New Method Works

Key Findings and Performance

Implications for AI Text Evaluation

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates