TLDR: A new lightweight framework called Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning allows large language models (LLMs) to grade their own responses using task-specific rubrics. This method significantly improves reasoning performance, especially on open-ended tasks like HealthBench, reduces training costs by up to 50%, and even enhances the model’s grading capabilities, enabling models like Qwen3-32B to outperform stronger baselines on complex medical reasoning.
Large language models (LLMs) are becoming increasingly vital in real-world applications, especially in complex areas like healthcare. However, evaluating their performance in open-ended reasoning tasks, where responses can vary widely, presents a significant challenge. Traditional reinforcement learning methods often struggle to generate reliable reward signals for these nuanced interactions.
A recent research paper, titled “Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning,” introduces an innovative framework to address this issue. Authored by Zhiling Ye, Yun Yue, Haowen Wang, Xudong Han, Jiadi Jiang, Cheng Wei, Lei Fan, Jiaxin Liang, Shuowen Zhang, Ji Li, Chunxiao Guo, Jian Wang, Peng Wei, and Jinjie Gu from Ant Group, this work proposes a lightweight and efficient training method that allows LLMs to grade their own responses using detailed rubrics.
The Challenge of Open-Ended Reasoning
In many real-world scenarios, users engage with LLMs through multi-turn dialogues, asking open-ended questions that don’t have a single, verifiable correct answer. This is particularly true in the healthcare domain, where accuracy and trustworthiness are paramount. Benchmarks like HealthBench, an open-source, dialogue-based evaluation for medical LLMs, use a detailed rubric-based scoring system to assess model performance. However, relying on human experts or even powerful proprietary LLMs as graders can be costly, slow, and introduce biases.
A Self-Improving Approach
The core idea behind this new framework is to leverage the LLM itself as a grader. Instead of an external reward model, the policy model (the LLM being trained) uses task-specific rubrics to evaluate its own generated responses. This “self-rewarding” mechanism creates a virtuous cycle: as the model improves its reasoning, it also becomes a more capable grader, providing higher-quality reward signals for further training.
The researchers observed that training the Qwen3-32B model with just 4,000 samples from the HealthBench Easy subset, using its own rubric-based scores, enabled it to surpass the performance of GPT-o3 on the more challenging HealthBench Hard subset. This highlights the potential for open-source models to achieve state-of-the-art results without relying on larger, proprietary grading models.
Efficiency and Performance Gains
One of the significant advantages of this self-rewarding approach is its impact on training efficiency. By eliminating the need for a separate, often slow, generative reward model (GRM) inference service, the framework substantially reduces resource consumption. The study reported a 30% reduction in single-step training time and about a 50% reduction in reward calculation time, even when using the same number of GPUs. This makes the training process faster and more resource-efficient.
Beyond efficiency, the method consistently enhances model performance. The model’s response length spontaneously increases during training, and its reasoning capabilities improve. Evaluations showed gains in crucial areas like completeness and context awareness. Interestingly, the model’s grading ability also improved after reinforcement learning training, further reinforcing the self-improving nature of the system.
Also Read:
- TruthRL: A Framework for More Reliable Language Models
- Bridging the Multilingual Reasoning Divide in Large Language Models
Dataset Insights and Future Directions
The research also explored the influence of different datasets. While incorporating a small amount of human-graded (teacher-graded) data from GPT-4.1 benefited weaker models like Qwen3-8B, it did not provide additional gains for more capable models like Qwen3-32B, suggesting that stronger models’ self-grading capabilities are already sufficient. Additionally, training with synthetic data proved effective but still lagged behind expert-curated data, emphasizing the importance of high-quality evaluation signals.
While the current experiments focused on the medical domain with HealthBench, the authors believe this self-rewarding rubric-based approach holds promise for other open-ended reasoning tasks. Future work could explore broader domains and investigate methods for generating high-quality rubric data using LLMs themselves, potentially matching or exceeding expert-curated data. For more details, you can read the full research paper here.


