TLDR: LaSeR (Reinforcement Learning with Last-Token Self-Rewarding) is a new algorithm that significantly improves Large Language Models’ (LLMs) reasoning and self-assessment abilities. It addresses the inefficiency of previous methods by deriving a ‘last-token self-rewarding score’ from the model’s final output token, requiring minimal extra computation. This score is used to jointly optimize reasoning and self-verification, leading to better performance in complex tasks and enabling LLMs to accurately evaluate their own solutions.
Large Language Models, or LLMs, have made incredible strides in various fields, but they still grapple with complex reasoning tasks. To tackle this, a method called Reinforcement Learning with Verifiable Rewards (RLVR) has emerged, which helps LLMs learn by rewarding correct reasoning paths. However, existing RLVR approaches often suffer from inefficiency. They either require a separate, external model to verify solutions or force the LLM to generate solutions and then verify them in two distinct, sequential steps, significantly slowing down the process.
Introducing LaSeR: A Smarter Way to Self-Reward
A new algorithm called LaSeR, which stands for Reinforcement Learning with Last-Token Self-Rewarding, offers an elegant solution to these efficiency problems. LaSeR’s core insight is surprisingly simple yet powerful: the true reasoning reward for a solution can be accurately represented by a ‘last-token self-rewarding score.’ This score is calculated from the model’s predicted probability distribution for a special, pre-specified token at the very end of a generated solution.
The brilliance of LaSeR lies in its efficiency. The researchers theoretically showed that the complex RL objective for self-verification can be simplified. They found that the log-probability of this special token under a reference model (a baseline model) remains almost constant. This allows LaSeR to compute the self-rewarding score using only the policy model’s output and a pre-calculated constant, making the process incredibly fast.
How LaSeR Works
During training, LaSeR augments the standard RLVR process with an additional Mean Squared Error (MSE) loss. This new loss function trains the model to align its calculated last-token self-rewarding score with the actual reasoning reward provided by a verifier. By doing this, LaSeR jointly optimizes both the LLM’s reasoning abilities and its capacity for self-assessment.
What’s truly remarkable is the minimal computational cost. LaSeR can derive these self-rewarding scores from the predicted next-token probability distribution of the last token immediately after generation. This means it incurs only the cost of one additional token inference, or potentially even zero extra inference steps, making it vastly more efficient than previous methods that required entirely separate generation steps for verification.
Benefits and Performance
Experiments conducted on popular LLM architectures like LLaMA and Qwen demonstrate LaSeR’s effectiveness. The method not only improves the model’s reasoning performance, leading to higher accuracy in complex math problems, but also equips it with a strong self-rewarding capability. This means the LLM becomes much better at judging the correctness of its own outputs, achieving high F1 scores in self-verification tasks. In some cases, LaSeR’s self-verification performance even matched that of much larger, dedicated external reward models.
This enhanced self-assessment ability also translates to better performance during inference. By using these optimized self-rewarding scores to rank and weight solutions, LaSeR significantly boosts the model’s inference-time scaling performance, especially when aggregating answers from multiple generated solutions.
Also Read:
- Unlocking LLM Evaluation: How Confidence Scores Can Transform Reward Models
- Unlocking Autonomous LLM Agents with Agentic Self-Learning
Beyond Math Reasoning
The researchers also explored LaSeR’s generalizability to other reasoning domains. While the self-verification accuracy in general reasoning tasks was not as high as in math, the optimized self-rewarding scores still provided valuable signals, leading to improved weighted majority voting results. This suggests promising avenues for future research to unlock LaSeR’s full potential across a broader range of applications.
LaSeR represents a significant step forward in making LLMs more intelligent and efficient. By enabling models to effectively self-verify with minimal overhead, it paves the way for more capable and practical AI systems. You can find the code and models for LaSeR at the project’s GitHub repository.


