TLDR: A new research paper introduces an LLM-driven pipeline to generate synthetic training data for health fact-checking. This method summarizes documents, extracts atomic facts, builds sentence-fact tables, and creates synthetic text-claim pairs. When combined with original data, it significantly boosts the F1 scores of BERT-based fact-checking models on PubHealth and SciFact datasets, addressing the challenge of limited annotated data in the health domain and even showing potential for detecting LLM hallucinations.
In the critical field of health information, ensuring the accuracy of claims is paramount to public well-being. However, developing reliable fact-checking systems faces a significant hurdle: a scarcity of high-quality, labeled training data. Traditional annotation processes for health-related content demand specialized medical expertise, making them costly and time-consuming. This often leads to models struggling to generalize effectively to medical claims, as existing general-purpose datasets lack the necessary domain-specific knowledge.
A recent research paper, titled “Enhancing Health Fact-Checking with LLM-Generated Synthetic Data,” proposes an innovative solution to this data limitation. Authored by Jingze Zhang, Jiahe Qian, Yiliang Zhou, and Yifan Peng, the study introduces a novel pipeline that leverages the power of large language models (LLMs) to create synthetic training data, thereby augmenting existing datasets and significantly improving the performance of health fact-checkers. You can read the full research paper here.
How the Synthetic Data Pipeline Works
The core of this research lies in its four-step synthetic data generation pipeline, designed to create a richer training set for fact-checking models:
1. Document Decomposition: The process begins by taking original source documents and generating concise summaries. Simultaneously, these summaries are broken down into ‘atomic facts’ – the most basic, indivisible pieces of information. This step ensures that facts are isolated and clearly defined.
2. Sentence-Fact Table Construction: An LLM is then employed to build a structured table. In this table, each sentence from the original document is mapped against each extracted atomic fact. The LLM determines and marks whether a given sentence supports a particular fact, establishing entailment relations.
3. Synthetic Data Generation: Using the meticulously constructed sentence-fact table, synthetic text-claim pairs are generated. A subset of sentences is randomly selected from the document and combined to form new text. An atomic fact is chosen as a synthetic claim, and its veracity (true or false) is automatically assigned by checking if any of the selected sentences support that fact in the table.
4. FACT CHECKER Development: Finally, these newly generated synthetic examples are merged with any original, manually annotated data. This augmented dataset is then used to fine-tune a BERT-based fact-checking model, referred to as FACT CHECKER. The model learns to classify whether a claim is supported or unsupported by a given document.
Impressive Performance Improvements
The effectiveness of this LLM-driven approach was rigorously evaluated on two public benchmark datasets: PubHealth and SciFact. The results demonstrated significant improvements in fact-checking performance. On the PubHealth dataset, the pipeline led to an F1 score improvement of up to 0.019. Even more notably, on the SciFact dataset, the F1 score saw an impressive increase of up to 0.049 compared to models trained solely on the original data.
The study also explored the impact of varying the ‘synthetic proportion’ – the percentage of sentences selected from original documents to construct synthetic data. While the optimal proportion varied across different subsets of data, the consistent finding was that incorporating synthetic data, especially at well-tuned proportions, consistently outperformed baselines. For instance, on a 1,500-instance subset of PubHealth, selecting just 10% of sentences yielded the highest F1 score of 0.831, surpassing the baseline of 0.812.
Detecting AI Hallucinations
Beyond enhancing fact-checking, the FACT CHECKER model also showed promise in a pilot study focused on detecting hallucinations in LLM-generated text summaries. By constructing sentence-fact tables for LLM-generated summaries and comparing them against original documents, the system could identify instances where facts in the summary were not supported by any sentence in the original document, indicating potential hallucinations or inferences not directly present in the source material.
Also Read:
- Backprompting: Enhancing AI Guardrails for Health Advice with Synthetic Data
- Advancing Medical AI: A Survey of Reasoning Capabilities in Large Language Models
Looking Ahead
This research underscores the immense potential of LLM-driven synthetic data augmentation in addressing the critical data scarcity issue in health fact-checking. By providing a scalable and efficient method to generate high-quality training examples, this pipeline offers a feasible solution for developing more robust and accurate fact-checking systems, ultimately contributing to a more informed public health landscape.


