TLDR: A new reinforcement learning framework, Critique-Post-Edit, significantly improves large language model personalization. It utilizes a Personalized Generative Reward Model (GRM) to provide multi-dimensional scores and textual critiques, helping models resist reward hacking. The framework also incorporates a Critique-Post-Edit mechanism, allowing the policy model to revise its own outputs based on these critiques for more targeted learning. This approach has shown substantial performance gains, with personalized Qwen2.5-14B models surpassing GPT-4.1 in personalization benchmarks.
Large language models (LLMs) are becoming increasingly sophisticated, moving beyond general assistance to personalized agents. However, truly tailoring these models to individual user preferences remains a significant challenge. Traditional methods like supervised fine-tuning (SFT) quickly hit performance limits, and standard reinforcement learning from human feedback (RLHF) often struggles with the subtleties of personalization. Scalar-based reward models, in particular, are prone to ‘reward hacking,’ where models generate verbose or superficially personalized responses to game the system rather than genuinely understand user needs.
A new research paper, Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning, introduces a robust reinforcement learning framework called Critique-Post-Edit to overcome these limitations. Authored by Chenghao Zhu, Meiling Tao, Dongyi Ding, Tiannan Wang, Yuchen Eleanor Jiang, and Wangchunshu Zhou, this framework aims to enable more faithful and controllable personalization in LLMs.
The Critique-Post-Edit Framework
The core of this innovative framework lies in two key components:
-
Personalized Generative Reward Model (GRM): Unlike traditional reward models that provide a single score, the GRM offers multi-dimensional scores (for helpfulness, personalization, and naturalness) along with detailed textual critiques. This rich feedback helps resist reward hacking by explaining what needs improvement and why.
-
Critique-Post-Edit Mechanism: This is where the policy model, after generating an initial response, revises its own output based on the specific critiques provided by the GRM. This iterative refinement process leads to more targeted and efficient learning.
The process works by having the policy model generate an initial response. The GRM then evaluates this response, providing both a scalar reward and textual feedback. This feedback is then used to prompt the policy model to create an edited, improved response. Both the original and edited responses are then used in the training process, providing a diverse and targeted learning signal. This approach is particularly well-suited for personalization, as there isn’t a single ‘golden’ answer for a user query; multiple nuanced responses can effectively reflect user preferences.
Impressive Performance Gains
The researchers conducted extensive evaluations on several personalization benchmarks, including PersonaFeedback, AlpacaEval, and PersonaMem, using a rigorous length-controlled evaluation protocol to ensure fair comparisons. The results demonstrate substantial improvements over standard PPO (Proximal Policy Optimization) training:
-
The personalized Qwen2.5-7B model achieved an average win-rate improvement of 11% over a strong PPO baseline.
-
The personalized Qwen2.5-14B model performed even better, not only matching this improvement but also surpassing the performance of GPT-4.1.
These gains were consistent across both specific and general questions, highlighting the framework’s robustness in various personalization scenarios. The 7B model significantly outperformed GPT-4o-mini, and the 14B version clearly surpassed GPT-4.1, showcasing the effectiveness and scalability of this approach for building truly personalized models.
Insights from Ablation Studies
An ablation study further validated the individual contributions of the GRM and the Critique-Post-Edit mechanism. Replacing the GRM with a traditional Bradley-Terry (BT) reward model led to a significant drop in performance and produced excessively long responses, confirming the severity of reward hacking and length bias issues. The GRM alone effectively mitigated length bias, but the full integration of GRM with feedback editing yielded the best results, proving that both components are essential for robust reward signals and targeted policy learning.
Interestingly, when exploring different sampling strategies for edited responses, random sampling surprisingly outperformed reward-based methods. This suggests that incorporating negative samples and maintaining a balanced selection of rollouts is crucial, especially when the policy model is already initialized from a personalized SFT model.
Also Read:
- Bridging Language and Numbers: How New AI Training Boosts LLM Reasoning
- Adaptive Search: How Reinforcement Learning Powers Intelligent AI Agents
A Practical Path Forward
This research presents a practical and effective method for developing faithfully personalized and controllable large language models. By combining generative reward modeling with structured, edit-based feedback, the Critique-Post-Edit framework offers a promising direction for scaling personalization to broader benchmarks and exploring even richer feedback modalities in the future.


