TLDR: The research introduces the Thinking-supervised Reward Model (TRM), a novel framework designed to imbue large language models (LLMs) with critical thinking abilities. TRM addresses the limitations of existing reinforcement learning methods in complex, open-domain tasks by explicitly evaluating answer sentences for both faithfulness to supporting documents and factual correctness through a structured reasoning process. By training models with sentence-level faithfulness, reasoning, and correctness signals, TRM significantly improves error detection and enhances the correctness and usefulness of generated answers, moving beyond mere semantic alignment to foster genuine understanding and assessment of knowledge.
Large language models (LLMs) have shown remarkable abilities in tasks like mathematics and coding, especially when trained with a method called reinforcement learning with verifiable rewards (RLVR). This approach works well because each step in solving these problems can be clearly checked and rewarded. However, when it comes to more complex, real-world tasks such as answering open-ended questions, LLMs face significant hurdles. The main challenge is that verifying the correctness of information in these nuanced situations is incredibly difficult.
Current methods often focus on ‘faithfulness,’ which means how well an answer aligns with the provided supporting documents. While important, an overemphasis on faithfulness can lead models to rely too heavily on external sources, neglecting their own internal knowledge and critical thinking skills. This can result in answers that are faithful to potentially misleading documents but are factually incorrect.
Introducing the Thinking-supervised Reward Model (TRM)
To tackle this, researchers have proposed a new approach called the Thinking-supervised Reward Model (TRM). This model aims to equip LLMs with critical thinking abilities by providing ‘thinking supervision’ at the sentence level. When given a question, an answer, and supporting documents, TRM first evaluates how faithful each sentence in the answer is to the documents. Following this, it performs a reasoning step to assess the factual correctness of each sentence.
By structuring the reward modeling process as a sequence of faithfulness, reasoning, and correctness evaluations, TRM encourages models to critically assess and use both external information and their own internal knowledge. This helps models distinguish between an answer that simply aligns with a document and one that is actually factually accurate.
How TRM is Developed and Trained
The development of TRM involves a two-stage training process. Initially, the model undergoes supervised fine-tuning (SFT) using a specially curated dataset. This dataset is designed to explicitly teach the model the structured reasoning process of moving from faithfulness to reasoning and then to correctness for each sentence. This sentence-level supervision helps the model understand these concepts and how to logically progress between them.
After SFT, a reinforcement learning (RL) phase further enhances the model’s prediction abilities. Unlike traditional RL that might only use a final correctness score, TRM integrates both faithfulness and correctness as reward signals at the sentence level. This dual-signal strategy encourages the model to not only produce correct answers but also to do so through faithful and understandable reasoning paths. An additional reward is also given for correctly identifying incorrect labels, which helps address the imbalance in data where correct sentences are far more common.
Impact and Results
Experiments have shown that TRM significantly improves the identification of incorrect sentences. When TRM is incorporated into policy optimization, which is how the LLM generates its answers, it leads to substantial improvements in both the correctness and usefulness of the answers. This is particularly evident in challenging open-domain question-answering tasks where verification is complex.
The research also highlights the importance of the explicit reasoning path within TRM. Variants of the model without this reasoning component performed less effectively, confirming that guiding the model through these thinking steps is crucial for developing robust critical thinking capabilities.
Also Read:
- TruthRL: A Framework for More Reliable Language Models
- ContextPRM: Enhancing LLM Reasoning Across Diverse Fields by Focusing on Logical Flow
Combining TRM with Preference Models
To further enhance answer quality, TRM is often combined with a ‘preference reward model’ (Prefer). While TRM focuses on factual correctness, the Prefer model captures other aspects of answer quality, such as usefulness. By using both models together, the system can generate answers that are not only factually accurate but also comprehensive and helpful to the user. This joint approach has demonstrated significant gains in both correctness and usefulness across different datasets.
This innovative framework represents a step forward in making large language models more reliable and capable in complex, knowledge-intensive tasks by fostering genuine critical thinking. For more details, you can refer to the full research paper: From Faithfulness to Correctness: Generative Reward Models That Think Critically.


