TLDR: TinyRM is a new family of small, efficient reward models (MLMs) that use FLAN-style prompting, Directional Low-Rank Adaptation (DoRA), and layer freezing to achieve performance comparable to much larger models on reasoning and safety tasks. This significantly reduces computational costs for AI alignment, demonstrating that lightweight bidirectional architectures can be efficient and scalable alternatives for preference modeling.
In the rapidly evolving field of artificial intelligence, particularly in the area of aligning large language models (LLMs) with human preferences, a crucial component is the reward model (RM). These models are essential for guiding LLMs to produce helpful and harmless outputs through a process called reinforcement learning from human feedback (RLHF). Traditionally, these reward models have been built using very large, decoder-based language models, often with billions of parameters. While effective, the sheer size of these models leads to significant computational costs, especially as they are increasingly used in real-time applications like guiding AI agents or filtering data.
A new research paper, titled “Tiny Reward Models,” introduces an innovative solution to this challenge. The paper, authored by Sarah Pan, presents TinyRM, a family of much smaller, more efficient reward models. These models are based on bidirectional masked language models (MLMs) and can have as few as 400 million parameters. Remarkably, TinyRM models are shown to perform comparably to models over 175 times larger on complex tasks like reasoning and safety preference modeling.
The core innovation behind TinyRM lies in its unique combination of training strategies. The researchers employed three key techniques:
FLAN-style Prompting
Instead of conventional classification methods, TinyRM uses a FLAN-style prompting approach. This method reformulates the reward modeling task as a “cloze question,” where the model predicts a masked token based on an instruction and two options (a chosen and a rejected response). This instruction-following format has been shown to be highly effective in eliciting strong language capabilities from models, even smaller ones.
Directional Low-Rank Adaptation (DoRA)
DoRA is a parameter-efficient finetuning method. It works by breaking down the model’s weights into magnitude and direction components and then using a technique called LoRA to make more controlled updates to the direction. For reasoning tasks, DoRA provided significant performance improvements, suggesting that lightweight tuning methods can effectively unlock hidden reasoning abilities in smaller models.
Also Read:
- Guiding Small Language Models to Reason with Cache Steering
- Optimizing Large Reasoning Models: Balancing Depth and Efficiency
Layer Freezing
To further enhance efficiency and focus the finetuning process, the lower layers of the model are frozen. This preserves the general language representations learned during pre-training, allowing the task-specific upper layers to be more effectively tuned for preference modeling. This strategy helps in achieving strong performance with fewer resources.
The TinyRM models were evaluated on RewardBench, a benchmark that assesses reward models across different categories: Chat, Reasoning, and Safety. The results were particularly impressive in the Reasoning and Safety domains. For instance, the large specialist TinyRM (400 million parameters) was competitive with a 70-billion-parameter model in the Reasoning task. This indicates that even small models, when trained with domain-specific strategies, can exhibit strong performance in areas requiring complex understanding and decision-making.
However, the paper also highlights some limitations. TinyRM models faced challenges in the Chat domain, which involves open-ended conversational preferences. The researchers hypothesize this might be due to the extensive conversational finetuning that larger, decoder-based LLMs typically receive, and a scarcity of high-quality, open-source conversational preference data for training smaller models. Despite this, preliminary improvements were observed by performing supervised finetuning on conversational data.
The implications of TinyRM are significant. By demonstrating that effective reward models can be built at a fraction of the computational cost, this research opens doors for more accessible and deployable preference learning systems. It suggests that focusing on eliciting existing capabilities through smart tuning strategies, rather than simply scaling up model size, can be a powerful approach. This work paves the way for more efficient AI alignment techniques, making advanced AI more sustainable and widely applicable. You can read the full research paper here.


