TLDR: This research introduces RL4HS, a reinforcement learning framework designed to accurately detect hallucinated spans within Large Language Model (LLM) outputs. By using a span-level reward function and a novel Class-Aware Policy Optimization (CAPO) to address reward imbalance, RL4HS significantly outperforms existing methods, including proprietary models and supervised fine-tuning, in identifying unsupported content across various natural language generation tasks. The study demonstrates that in-domain reasoning learned through this framework is crucial for robust hallucination detection, leading to more reliable and factually consistent LLM outputs.
Large Language Models (LLMs) have become incredibly powerful, but they often generate content that isn’t supported by facts or the input context. This phenomenon, known as hallucination, poses significant risks in applications where accuracy and reliability are crucial, such as summarization and question answering. While many previous efforts focused on simply determining if a text contains hallucinations (a binary task), real-world applications often demand a more precise understanding: identifying exactly which parts, or ‘spans,’ of the text are hallucinated.
This challenge naturally leads to a key question: can explicit reasoning help in the complex task of detecting these hallucinated spans? A new research paper, Learning to Reason for Hallucination Span Detection, explores this question and introduces a novel framework called RL4HS.
The Power of Reasoning and Reinforcement Learning
The authors, Hsuan Su, Ting-Yao Hu, Hema Swetha Koppula, Kundan Krishna, Hadi Pouransari, Cheng-Yu Hsieh, Cem Koc, Joseph Yitan Cheng, Oncel Tuzel, and Raviteja Vemulapalli, from National Taiwan University and Apple, first evaluated existing pretrained models with and without Chain-of-Thought (CoT) reasoning. They found that while CoT reasoning offered limited immediate gains, it showed significant potential to generate correct answers when sampled multiple times. This insight motivated their development of RL4HS (Reinforcement Learning for Hallucination Span Detection).
RL4HS is a reinforcement learning framework designed to encourage reasoning by using a unique span-level reward function. It builds upon an existing method called Group Relative Policy Optimization (GRPO) and introduces a new component: Class-Aware Policy Optimization (CAPO). This framework aims to train LLMs to not just detect, but to reason through the process of identifying unsupported content.
Addressing Reward Imbalance with CAPO
One of the critical challenges in training such models is the ‘reward imbalance.’ In traditional GRPO, models might be over-incentivized to predict ‘no hallucination’ because it’s an easier target to get a high reward (simply by predicting an empty span list). Detecting actual hallucinations, however, requires precise localization, and small errors can lead to steep drops in reward. This can lead to a phenomenon called ‘reward hacking,’ where the model learns shortcuts that maximize rewards without genuinely improving detection.
To counteract this, the researchers developed Class-Aware Policy Optimization (CAPO). CAPO introduces a scaling factor that adjusts the advantage values for samples belonging to the non-hallucination class. By using a smaller scaling factor for non-hallucination predictions, CAPO helps balance the contributions of both hallucination and non-hallucination classes, preventing the model from becoming overly conservative and improving its ability to detect actual hallucinations.
Also Read:
- TruthRL: A Framework for More Reliable Language Models
- Enhancing Language Models with Critical Thinking for Accurate Answers
Experimental Success and Key Findings
The RL4HS framework was tested on the RAGTruth benchmark, which includes summarization, question answering, and data-to-text tasks with human-labeled hallucination spans. The results were compelling:
- RL4HS consistently outperformed all baselines, including proprietary models like GPT-4o-mini, GPT-5-mini, GPT-5, and o3, as well as supervised fine-tuning (SFT) and other reasoning models.
- The Class-Aware Policy Optimization (CAPO) successfully mitigated reward hacking, leading to a better balance between precision and recall and higher overall span F1 scores compared to vanilla GRPO.
- The research also highlighted the necessity of ‘in-domain’ reasoning. Models trained specifically for hallucination span detection with span-level rewards (like RL4HS) performed significantly better than larger, general-purpose reasoning models, even when those models were much bigger.
- A qualitative case study demonstrated that RL4HS learns to perform systematic consistency checks, mirroring human-designed heuristic pipelines. This suggests that the learned reasoning behavior is genuine and semantically grounded, not just surface-level explanations.
In conclusion, RL4HS represents a significant step forward in making LLMs more reliable by enabling them to precisely identify and reason about hallucinated content. By combining reinforcement learning with span-level rewards and addressing reward imbalance, this framework offers a powerful approach to enhancing the factual consistency of AI-generated text.


