TLDR: A new framework called Dimension-level Reward Model (DRM) is proposed to improve Large Language Models’ (LLMs) multi-step reasoning. Unlike traditional methods that only reward final answers or require complex step-by-step segmentation, DRM evaluates reasoning processes across three interpretable dimensions: Confidence, Relevance, and Coherence. This approach provides dense, generalizable, and interpretable feedback, leading to significant improvements in LLM performance on diverse tasks, even for out-of-distribution scenarios, by directly optimizing the quality of the reasoning process itself.
Large Language Models (LLMs) have become incredibly powerful, but their ability to perform complex, multi-step reasoning remains a significant challenge. Traditionally, methods like Reinforcement Learning with Verifiable Rewards (RLVR) have been used to improve LLMs. RLVR works by giving a reward only if the final answer is correct. However, this approach has limitations: it often overlooks flaws in the reasoning process itself, meaning a model might get the right answer through faulty logic, and it provides sparse feedback, making it hard for the model to learn effectively.
Another approach, Process-level Reward Models (PRMs), tries to address this by giving feedback at each step of the reasoning process. While promising, PRMs often require the reasoning process to be broken down into individual steps, which can be difficult and task-specific, limiting their ability to generalize to new, open-ended tasks. They can also act as ‘black boxes,’ making it hard to understand why a certain score was given.
To overcome these issues, researchers have introduced a new supervision framework called the Dimension-level Reward Model (DRM). This innovative approach bridges the gap between outcome-based and process-level supervision by evaluating the quality of an LLM’s reasoning process along three fundamental, complementary, and easily understandable dimensions:
Confidence
This dimension assesses how certain the model is about its generated reasoning and final answer. It helps ensure that the LLM’s output is faithful to the question and supporting information, preventing it from ‘hallucinating’ or deviating from the core task. For the reasoning part, it measures the average log-probability of tokens, while for the answer, it sums these probabilities to encourage decisive outputs.
Relevance
Relevance evaluates whether the reasoning process is semantically aligned and contextually appropriate with the original question, any provided documents, and the final answer. This dimension ensures that the reasoning stays grounded in the given information and logically leads to the conclusion. It uses techniques like Natural Language Inference (NLI) and semantic similarity to measure these relationships.
Also Read:
- Enhancing LLM Reasoning with Attribution-Based Credit Assignment and Dynamic Exploration
- Dynamic Temperature Control Enhances LLM Reasoning in Reinforcement Learning
Coherence
This dimension focuses on the logical consistency, fluency, and overall quality of the reasoning process. It penalizes self-contradictory statements and ensures that the steps flow logically. An external Outcome-level Reward Model (ORM) is used to assess this textual quality and logical consistency.
By combining these three dimensions, DRM provides a dense, reasoning-aware reward signal that is interpretable and doesn’t require task-specific segmentation or ground truth answers for every step. The overall DRM reward is calculated as a weighted sum of these individual dimensional scores, allowing for a nuanced assessment of reasoning quality.
The effectiveness of DRM has been demonstrated in various experiments. When used in off-policy optimization (like DPO), DRM guides the selection of high-quality reasoning samples for training. In on-policy optimization (like GRPO), it can serve as a standalone reward or be integrated with traditional answer-based rewards. Experimental results show that DRM-supervised training consistently improves LLM performance across a diverse range of open-domain tasks, including mathematics, question answering, code execution, and puzzles. It even shows strong generalization to tasks outside of its training distribution.
Notably, DRM-supervised models have been shown to outperform models trained with only answer supervision (RLVR) and other existing reasoning-supervision approaches. A significant finding is that DRM effectively reduces instances of ‘correct answers with flawed reasoning,’ where a model arrives at the right answer through incorrect logic. This means DRM not only helps models get more correct answers but also ensures the quality and trustworthiness of the underlying thought process.
Furthermore, combining DRM supervision with RLVR often leads to even greater improvements, suggesting a synergistic effect between optimizing for reasoning quality and final answer correctness. This framework is also architecture-agnostic and data-efficient, achieving broad improvements using a single source of preference data without requiring task-specific fine-tuning.
In conclusion, the Dimension-level Reward Model (DRM) represents a significant step forward in optimizing LLMs. By providing interpretable, multidimensional feedback on the reasoning process itself, DRM enhances LLMs’ generalized reasoning ability, leading to more reliable and understandable AI systems. You can read the full research paper here.


