TLDR: This paper introduces a new method for training Large Language Models (LLMs) to generate high-quality, pedagogically sound explanations. It uses a small, efficient encoder-only transformer as a semantic reward model within the GRPO framework. This model calculates the conceptual similarity between generated and reference explanations, providing a dense reward signal. Applied to Italian medical-school entrance exams, this approach significantly improves explanation faithfulness and clarity, outperforming traditional keyword-based or LLM-as-a-judge reward methods.
Large Language Models (LLMs) have shown remarkable abilities in generating text that resembles human writing. However, a significant hurdle remains in aligning their outputs with complex, qualitative objectives, such as ensuring an explanation is pedagogically sound or truly helpful for learning. Traditional methods for guiding LLMs often fall short: using another large LLM to judge responses can be slow and costly, while simpler keyword-based metrics like ROUGE fail to grasp the deeper meaning and structure of a high-quality explanation.
A new research paper, “Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO”, introduces an innovative solution to this challenge. The authors, Francesco Pappone, Ruggero Marino Lazzaroni, Federico Califano, Niccolò Gentile, and Roberto Marras, propose a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Their central idea is to employ a small, efficient encoder-only transformer as a semantic reward model. This model generates a rich, semantically meaningful reward signal by calculating the cosine similarity between a generated explanation and a reference explanation provided by an expert. This guides the LLM to produce explanations that are not just factually correct, but also conceptually and structurally aligned with expert reasoning.
How the System Works
The training process involves three main stages. First, a domain-adaptive continued pre-training (CPT) phase equips the model with specialized knowledge from a relevant corpus, such as textbooks. Second, supervised fine-tuning (SFT) teaches the model the desired output format for questions and explanations. Finally, the reinforcement learning stage uses GRPO, where the new semantic reward shaping is applied.
The total reward for a generated explanation is a combination of four distinct signals:
- Semantic Similarity: This is the core component, measuring the conceptual alignment between the generated and ground-truth explanations. It uses a pre-trained 600M-parameter encoder-only transformer (qwen3-0.6B) to create dense vector embeddings. The reward is derived from the adjusted cosine similarity of these embeddings.
- Factual Accuracy: A binary reward is given if the model’s final answer exactly matches the correct ground-truth answer.
- Structural Predictability: A rule-based reward ensures the output correctly uses required formatting tags, like <spiegazione> and <risposta>.
- Reasoning Process: A reward is given for including a non-empty “chain-of-thought” block within designated <think> tags, encouraging the model to externalize its reasoning.
Real-World Application and Impact
The researchers applied this method to train a model for the Italian medical-school entrance examinations, a domain that demands both specialized knowledge and clear, didactic rationales. The results were compelling. GRPO with the proposed semantic reward significantly improved the faithfulness and clarity of explanations compared to a strong SFT baseline. In evaluations, including those judged by external LLMs, the semantic GRPO variant achieved the highest Elo ratings and reasoning accuracy, outperforming approaches that relied on ROUGE or even LLM-as-a-judge models.
Interestingly, combining the semantic reward with ROUGE metrics slightly reduced the overall performance, suggesting that lexical-overlap pressure can dilute the intended semantic alignment. Furthermore, LLM-as-a-judge rewards were found to be less competitive and more variable, highlighting the instability and lower average quality compared to the lightweight encoder-based rewards.
Also Read:
- Evaluating LLM Explanations: Moving Beyond Simple Preferences
- Policy Optimization for LLMs: A Single-Stream Approach for Enhanced Efficiency
Key Takeaways
This work underscores several important points:
- The semantic encoder reward is the primary driver of explanation quality.
- Mixing lexical rewards with semantic ones can be counter-productive for this task.
- Continued pre-training provides a valuable foundation, but supervised fine-tuning alone doesn’t match the gains achieved with reinforcement learning guided by a semantic reward.
- Lightweight encoder models offer a more stable and effective alternative to LLM-as-a-judge rewards for nuanced reward shaping.
By leveraging small, specialized models, this research demonstrates a practical and efficient way to guide larger language models towards generating high-quality, pedagogically sound explanations, opening new avenues for improving AI tutors and educational tools.


