spot_img
HomeResearch & DevelopmentOpenFActScore: Advancing Open-Source Evaluation for AI Text Factuality

OpenFActScore: Advancing Open-Source Evaluation for AI Text Factuality

TLDR: OpenFActScore is a new open-source framework for evaluating the factual accuracy of text generated by large language models (LLMs). It builds upon the existing FActScore method but replaces its reliance on commercial, closed-source models with any Hugging Face-compatible models for both extracting factual claims (Atomic Fact Generation) and verifying them against knowledge sources (Atomic Fact Validation). This innovation promotes transparency, reproducibility, and cost-effectiveness in AI evaluation, demonstrating a high correlation with the original FActScore’s results while utilizing open models like Gemma and Olmo for optimal performance.

The rapid growth in the use of Large Language Models (LLMs) for various daily tasks has highlighted a critical need for robust evaluation methods, especially concerning the factual accuracy of their generated text. While many aspects of LLM performance, such as reasoning and mathematics, can be assessed with standard metrics, evaluating factuality presents a unique challenge.

One prominent framework for assessing factuality is FActScore, which breaks down the evaluation into a two-stage process. First, it involves Atomic Fact Generation (AFG), where individual factual claims are extracted from long-form text. Second, it performs Atomic Fact Validation (AFV), verifying each extracted claim against a trusted knowledge source. The original FActScore, however, relied on proprietary and commercial models like InstructGPT and ChatGPT, which can be costly and limit transparency.

Addressing these limitations, researchers have introduced OpenFActScore, an open-source implementation of the FActScore framework. This new approach allows for the use of any Hugging Face-compatible model for both the Atomic Fact Generation and Atomic Fact Validation stages. This shift promotes greater transparency, reproducibility, and offers a more cost-effective solution for evaluating the factual accuracy of LLM outputs.

OpenFActScore maintains the core methodology of its predecessor. It defines atomic facts as short, undeniable statements containing a single piece of information, allowing for a fine-grained analysis of text. The factuality score is then calculated as the proportion of supported atomic facts over the total number generated. The framework makes key assumptions: that the support of an atomic fact by a knowledge source is undebatable, that all atomic facts hold equal importance, and that information within the knowledge base does not conflict.

The implementation of OpenFActScore involved refactoring the original FActScore codebase to integrate Hugging Face models, enabling support for chat templates and system prompts. This allows for more generalized and effective prompting strategies for both AFG and AFV tasks.

In their evaluation, the creators of OpenFActScore tested several popular open-source LLMs, including Gemma, Qwen, Llama 3.1-Instruct, and Olmo. For Atomic Fact Generation, they measured the semantic similarity of generated atomic facts to human-corrected gold standards using BERTScore-F1. Gemma and Olmo consistently demonstrated strong alignment with human-generated facts, indicating their proficiency in this task.

For Atomic Fact Validation, models were evaluated on their ability to verify atomic facts against retrieved reference documents. The Error Rate, defined as the difference between human-annotated FActScore and machine-estimated FActScore, was used. Gemma and Llama3.1 showed lower cumulative error rates, suggesting better alignment with human judgments in validating facts. While Olmo performed well in generation, its validation performance showed larger deviations from human judgments.

Based on their findings, the researchers propose using Olmo for Atomic Fact Generation and Gemma for Atomic Fact Validation within OpenFActScore. This combination leverages Olmo’s fully open-source nature and strong generation capabilities, paired with Gemma’s reliable validation performance. Despite some differences in absolute scores compared to the original FActScore, OpenFActScore demonstrated a high Pearson correlation of 0.99 with the original framework’s experiments, indicating that the ranking of models based on factuality remains consistent.

Also Read:

OpenFActScore represents a significant step towards making factuality evaluation more accessible and transparent for the broader AI community. By enabling the use of open-source models, it reduces reliance on proprietary systems and fosters further research into improving the factual accuracy of large language models. You can find more details about this work in the research paper: OpenFActScore: Open-Source Atomic Evaluation of Factuality in Text Generation.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -

Previous article
Next article