Boosting LLM Reasoning with Last-Token Self-Rewarding

TLDR: LaSeR (Reinforcement Learning with Last-Token Self-Rewarding) is a new algorithm that significantly improves Large Language Models’ (LLMs) reasoning and self-assessment abilities. It addresses the inefficiency of previous methods by deriving a ‘last-token self-rewarding score’ from the model’s final output token, requiring minimal extra computation. This score is used to jointly optimize reasoning and self-verification, leading to better performance in complex tasks and enabling LLMs to accurately evaluate their own solutions.

Large Language Models, or LLMs, have made incredible strides in various fields, but they still grapple with complex reasoning tasks. To tackle this, a method called Reinforcement Learning with Verifiable Rewards (RLVR) has emerged, which helps LLMs learn by rewarding correct reasoning paths. However, existing RLVR approaches often suffer from inefficiency. They either require a separate, external model to verify solutions or force the LLM to generate solutions and then verify them in two distinct, sequential steps, significantly slowing down the process.

Introducing LaSeR: A Smarter Way to Self-Reward

A new algorithm called LaSeR, which stands for Reinforcement Learning with Last-Token Self-Rewarding, offers an elegant solution to these efficiency problems. LaSeR’s core insight is surprisingly simple yet powerful: the true reasoning reward for a solution can be accurately represented by a ‘last-token self-rewarding score.’ This score is calculated from the model’s predicted probability distribution for a special, pre-specified token at the very end of a generated solution.

The brilliance of LaSeR lies in its efficiency. The researchers theoretically showed that the complex RL objective for self-verification can be simplified. They found that the log-probability of this special token under a reference model (a baseline model) remains almost constant. This allows LaSeR to compute the self-rewarding score using only the policy model’s output and a pre-calculated constant, making the process incredibly fast.

How LaSeR Works

During training, LaSeR augments the standard RLVR process with an additional Mean Squared Error (MSE) loss. This new loss function trains the model to align its calculated last-token self-rewarding score with the actual reasoning reward provided by a verifier. By doing this, LaSeR jointly optimizes both the LLM’s reasoning abilities and its capacity for self-assessment.

What’s truly remarkable is the minimal computational cost. LaSeR can derive these self-rewarding scores from the predicted next-token probability distribution of the last token immediately after generation. This means it incurs only the cost of one additional token inference, or potentially even zero extra inference steps, making it vastly more efficient than previous methods that required entirely separate generation steps for verification.

Benefits and Performance

Experiments conducted on popular LLM architectures like LLaMA and Qwen demonstrate LaSeR’s effectiveness. The method not only improves the model’s reasoning performance, leading to higher accuracy in complex math problems, but also equips it with a strong self-rewarding capability. This means the LLM becomes much better at judging the correctness of its own outputs, achieving high F1 scores in self-verification tasks. In some cases, LaSeR’s self-verification performance even matched that of much larger, dedicated external reward models.

This enhanced self-assessment ability also translates to better performance during inference. By using these optimized self-rewarding scores to rank and weight solutions, LaSeR significantly boosts the model’s inference-time scaling performance, especially when aggregating answers from multiple generated solutions.

Also Read:

Beyond Math Reasoning

The researchers also explored LaSeR’s generalizability to other reasoning domains. While the self-verification accuracy in general reasoning tasks was not as high as in math, the optimized self-rewarding scores still provided valuable signals, leading to improved weighted majority voting results. This suggests promising avenues for future research to unlock LaSeR’s full potential across a broader range of applications.

LaSeR represents a significant step forward in making LLMs more intelligent and efficient. By enabling models to effectively self-verify with minimal overhead, it paves the way for more capable and practical AI systems. You can find the code and models for LaSeR at the project’s GitHub repository.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Boosting LLM Reasoning with Last-Token Self-Rewarding

Introducing LaSeR: A Smarter Way to Self-Reward

How LaSeR Works

Benefits and Performance

Beyond Math Reasoning

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates