TLDR: A new research paper argues that AI reward models, used for training large language models (LLMs), and evaluation metrics, used for assessing LLM performance, are essentially performing the same task but have evolved separately. This separation has led to duplicated challenges and missed opportunities. The paper advocates for a closer collaboration between these two fields, demonstrating how insights and techniques from one can significantly benefit the other, particularly in areas like preventing reward hacking and improving data quality. It also highlights the importance of shared methodologies and terminology while cautioning against a complete merger that could stifle innovation.
A recent preprint titled “Reward Models are Metrics in a Trench Coat” by Sebastian Gehrmann from Bloomberg sheds light on a critical, yet often overlooked, connection in the world of large language models (LLMs). The paper argues that two seemingly distinct research areas – reward models and evaluation metrics – are fundamentally similar and would greatly benefit from closer collaboration.
Reinforcement learning (RL) plays a significant role in refining LLMs, helping them align with desired behaviors and adapt to various tasks. A key component of this process is the reward model, which assesses the quality of an LLM’s output to generate signals for training. This function, the paper points out, is strikingly similar to what evaluation metrics do: monitor an AI model’s performance.
Despite these strong parallels, the research paper highlights that these two fields have largely developed in isolation. This separation has led to redundant terminology and repeated challenges. Common issues faced by both include susceptibility to spurious correlations (where models optimize for unintended patterns), the risk of reward hacking (where models exploit flaws in the reward system), difficulties in improving data quality, and challenges in meta-evaluation (evaluating the evaluators themselves).
The Case for Collaboration
Gehrmann’s position paper advocates for a closer alignment between reward models and evaluation metrics to overcome these shared problems. To support this argument, the paper presents evidence showing how traditional metrics can sometimes outperform reward models on specialized tasks. It also provides an extensive survey of both fields, identifying multiple research areas where collaboration could lead to significant improvements. These areas include better methods for eliciting preferences, more effective ways to avoid spurious correlations and reward hacking, and meta-evaluation techniques that are aware of calibration issues.
Historically, the development of reward models has been influenced by the broader adoption of RL in deep learning. Early RL approaches for language generation struggled with sparse reward signals due to the vast complexity of language. This led to methods that used evaluation metrics as reward functions, aiming to optimize directly for human-preferred outcomes during training. Initially, these reward models relied on lexical overlap metrics like BLEU and ROUGE, which had known limitations and could lead to reward hacking. The introduction of semantic similarity metrics helped mitigate some of these issues, paving the way for more robust reward models.
The concept of Reinforcement Learning from Human Feedback (RLHF), popularized by work in game playing and robotics, further emphasized the role of dedicated reward models to capture human preferences. However, the paper notes a continued disconnect, with reward model benchmarks and evaluation metric benchmarks often existing in parallel without meaningful interaction.
Quantifying the Divide
The paper quantifies this separation through a citation analysis. It observes that while research on reward modeling and “LLM-as-a-judge” (using LLMs for evaluation) has rapidly grown, the number of papers specifically mentioning “evaluation metrics” has declined. The analysis of citation graphs reveals that inter-field citations are rare, accounting for less than 10% of total cited papers. Reward model research, in particular, tends to cite other reward model papers and focuses heavily on machine learning venues, rather than drawing from the broader NLP and Computer Vision communities where evaluation metrics are more prevalent.
Two experiments further illustrate the potential benefits of cross-field interaction. In one experiment, a three-year-old machine translation evaluation metric, CometKiwi, performed comparably to or even outperformed much larger, newer LLM-based reward models on a challenging translation benchmark. This suggests that “sophisticated mechanisms” needed in reward models might already exist within the metrics field. In another experiment, LLM-as-a-judge models underperformed dedicated metrics on a factuality evaluation benchmark for summarization, indicating that for specialized tasks, dedicated metrics can still be superior.
Similarities and Differences
While both fields aim to align with human preferences, they are not identical. Evaluation metrics often focus on narrow, clearly defined quality aspects, emphasizing transparency and reproducibility. Reward models, especially those used for training LLMs, need to assess a broader range of preferences, including safety and refusal of undesired requests, making them more application-specific and less transferable. However, the paper notes a convergence in “aspect-aligned reward models” that score rubrics, similar to how dedicated metrics assess fine-grained criteria.
Data collection methodologies and optimization targets also present shared challenges and opportunities. Both fields grapple with collecting high-quality human feedback from diverse and expert raters. The choice between pairwise preferences and continuous scores for evaluation also impacts model development and benchmarking practices. Both can benefit from advances in model compression techniques.
Identifying and debugging reward hacking is another shared concern. Reward hacking, where models exploit flaws to maximize scores without achieving the intended behavior, is not unique to reward models; it’s a known issue in many classification models. Both fields can learn from diagnostic datasets, distractor generation, and model interpretability to address spurious correlations.
Meta-evaluation, the process of evaluating evaluators, is an area where the fields have diverged significantly. The paper suggests that reward model benchmarking could benefit from adopting practices from metrics, such as reporting segment-level performance and assessing score calibration, to better reflect downstream model performance.
Also Read:
- BayesianRouter: A Smart Approach to Aligning Language Models with Human Preferences
- Unpacking AI’s Thought Process: A New Framework for Evaluating Tool-Augmented Agents
Recommendations for the Future
The paper concludes with several recommendations. It emphasizes that both fields need high-quality training and meta-evaluation data, grounded in clear definitions and relevant sociotechnical contexts. Identifying and mitigating spurious correlations is crucial for both. New methods for modeling human preferences should be evaluated across both metrics and reward model benchmarks to provide a comprehensive understanding of their performance.
However, the paper cautions against a complete collapse of the two fields into a monoculture. Citing Goodhart’s law – “when a measure becomes a target, it ceases to be a good measure” – it argues that over-reliance on a single benchmark can lead to overfitting and a lack of genuine technological advancement. Instead, the recommendation is for the fields to share insights into methodologies and terminologies, fostering a symbiotic relationship where improvements in one can inform the other, without losing their distinct focuses. For more details, you can read the full paper here.


