TLDR: RLMR (Reinforcement Learning with Mixed Rewards) is a new framework that helps large language models balance subjective writing quality (like emotional expression) with objective constraints (like word limits) in creative writing. It uses a dynamic system that penalizes responses violating rules, leading to consistent improvements in both writing quality and instruction following, outperforming previous methods.
Large language models (LLMs) are increasingly used for creative writing, from poetry to commercial copywriting. However, creative writing presents a unique challenge: balancing subjective qualities like literariness and emotional expression with objective requirements such as word limits and specific formats. Traditional reinforcement learning methods often struggle with this dual nature, either focusing too much on one aspect or using fixed reward systems that can’t adapt to different writing situations.
Existing approaches fall short in two main ways. Single reward strategies, which might only score writing quality, fail to ensure that the generated text also follows all the rules. On the other hand, methods that combine multiple rewards often use a fixed weighting system. This means they can’t dynamically adjust how much importance is given to subjective quality versus objective constraints based on how well the model is actually performing in a given scenario.
Introducing RLMR: A Dynamic Approach to Creative Writing
To overcome these limitations, researchers have proposed Reinforcement Learning with Mixed Rewards (RLMR). This innovative framework uses a dynamic mixed-reward system that intelligently combines feedback from two specialized models: a writing reward model and a constraint verification model. The core idea behind RLMR is its ability to dynamically adjust the weight of the constraint-following reward. This adjustment happens based on the writing quality within a group of sampled responses, ensuring that any text that violates constraints receives a negative penalty during training. This is a crucial step that helps the model learn to produce both high-quality and compliant creative content.
How RLMR Works
The RLMR framework integrates two key components:
- Writing Reward Model: This model evaluates the subjective quality of creative writing outputs, considering aspects like literary expression, emotional depth, originality, and narrative coherence. It’s trained on human-annotated preferences to capture what makes creative writing truly good.
- Constraint Verification Model: This model acts as a strict checker, identifying any violations of objective task requirements, such as word counts, formatting rules, or specific content restrictions. It provides a binary pass/fail signal for each response.
The real innovation lies in RLMR’s Dynamic Reward Adjustment Strategy. Unlike fixed-weight systems, RLMR modifies the original rewards before the model learns from them. If a generated response violates a constraint, a penalty is applied to its reward. This penalty is calculated dynamically to ensure that all constraint-violating samples receive a negative advantage, effectively teaching the model to avoid such errors. This mechanism ensures that the model prioritizes learning from high-quality, compliant responses while actively suppressing the generation of texts that fail to meet the rules.
Demonstrated Effectiveness
The effectiveness of RLMR was rigorously tested across various large language models, including different scales of Qwen and DeepSeek models. Evaluations were conducted using both automated benchmarks and human assessments, focusing on writing quality and instruction following. The results were compelling:
- RLMR consistently improved both instruction following (e.g., IFEval accuracy from 83.36% to 86.65% on Qwen2.5-32B) and writing quality (achieving a 72.75% win rate in manual expert pairwise evaluations on the WriteEval benchmark).
- It significantly outperformed baseline methods, including those using only writing rewards, only verification signals, or fixed-weight linear combinations of rewards.
- Human evaluations confirmed a strong preference for RLMR-generated content, indicating higher satisfaction and usability in creative writing tasks.
Furthermore, an analysis of training dynamics showed that RLMR successfully prevents “reward hacking” – a common issue where models learn to exploit the reward system without genuinely improving. For instance, models trained with only writing rewards often generated excessively long responses to achieve higher scores, but failed to follow length constraints. RLMR, however, maintained balanced optimization, producing high-quality content while adhering to specified lengths.
Also Read:
- Enhancing LLM Reasoning with Rubric-Guided Reinforcement Learning
- Dynamic Steering and Backtracking: A New Way to Guide Large Language Models
A Step Forward for Creative AI
RLMR represents a significant advancement in optimizing large language models for creative writing. By intelligently balancing subjective creative quality with objective constraint adherence through a dynamic reward adjustment mechanism, it provides a robust and efficient solution for multi-dimensional creative writing tasks. This approach not only enhances the performance of LLMs in generating creative content but also ensures that the outputs are practical and usable, meeting all specified requirements. For more details, you can refer to the original research paper: RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing.


