Scaling AI Evaluation: Salesforce AI Introduces Foundational Automatic Reasoning Evaluators (FARE)

TLDR: Salesforce AI Research has developed FARE, a new family of 8B and 20B parameter AI evaluators. Trained on a massive 2.5 million sample dataset across five evaluation tasks and multiple reasoning domains using a simple iterative rejection-sampling finetuning approach, FARE-8B and FARE-20B demonstrate superior performance over existing specialized and larger evaluators. They prove highly effective in real-world applications like inference-time reranking, reinforcement learning verification, and domain-specific finetuning, setting a new standard for open-source evaluation.

Salesforce AI Research has unveiled a new family of automatic evaluators called Foundational Automatic Reasoning Evaluators (FARE), marking a significant advancement in how large language models (LLMs) are assessed. This research, detailed in their preprint titled “FOUNDATIONAL AUTOMATIC EVALUATORS: SCALING MULTI-TASK GENERATIVE EVALUATOR TRAINING FOR REASONING-CENTRIC DOMAINS,” addresses the growing need for scalable and versatile evaluation methods in the rapidly evolving field of AI. The paper was authored by Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, and Shafiq Joty.

The core challenge FARE aims to tackle is the increasing demand for efficient and accurate evaluation of LLM outputs, both during training and at test time. While previous efforts have often focused on new methodologies like reinforcement learning (RL) for training evaluators, the Salesforce AI team emphasized a data-driven approach, curating an extensive dataset of 2.5 million samples.

A New Approach to Data and Training

The FARE project distinguishes itself by focusing on large-scale, multi-task, and multi-domain data. This massive dataset spans five unique evaluation tasks: pairwise comparisons, step-level evaluation, reference-free verification, reference-based verification, and single rating. Crucially, it covers diverse domains with a strong emphasis on reasoning, including math, code, tool-use evaluation, and natural language reasoning. This comprehensive data collection ensures that FARE evaluators are well-rounded and capable of handling a wide array of scenarios.

For training, the researchers employed a straightforward yet effective iterative rejection-sampling supervised finetuning (SFT) approach. This method, known as RS-SFT, offers a stable and computationally efficient way to train evaluators at scale, avoiding common issues like distribution shifts that can arise with other training paradigms. The process involves generating multiple responses from the policy model, selecting the correct ones based on ground-truth judgments, and then using these correct responses to update the model.

Also Read:

Unprecedented Performance Across Benchmarks

The FARE family includes two models: FARE-8B and FARE-20B (with 3.6B active parameters). These models have demonstrated remarkable performance, challenging and often surpassing larger, more specialized evaluators. FARE-8B, despite its smaller size, competes effectively with larger RL-trained evaluators. FARE-20B, on the other hand, sets a new benchmark for open-source evaluators, outperforming specialized models with over 70 billion parameters.

Beyond static benchmarks, FARE was rigorously evaluated in real-world applications:

Inference-time Reranking: When used as rerankers during inference, FARE-20B achieved near-oracle performance on complex math problems (MATH benchmark), significantly boosting the quality of generated responses.
RL Training Verification: As verifiers in reinforcement learning training, FARE models improved the performance of downstream RL-trained models by up to 14.1% compared to traditional string-matching verifiers. This highlights their ability to provide more nuanced and effective feedback for model improvement.
Domain-Specific Finetuning: When initialized from FARE, a continually finetuned version called FARE-Code outperformed gpt-oss-20B by an impressive 65% in evaluating test-case quality for code generation, demonstrating its adaptability to specific domains with minimal additional training.

The design philosophy behind FARE emphasizes efficiency and precision. The models are built for low-latency evaluation, crucial for tasks like inference-time reranking. They also utilize compact “thinking” chains-of-thought (CoTs) and are designed to avoid generating reference answers themselves, which can sometimes degrade performance if the generated reference is incorrect.

The data curation process combined existing high-quality datasets with synthetically generated data. Synthetic data was created through programmatic error injection (especially for tool-use scenarios) and a “generate-then-grade” strategy, where responses from various generator models were graded against verifiable ground-truth answers. This blend ensured a diverse and challenging training set.

In conclusion, FARE represents a significant leap forward in automatic evaluation for LLMs. By combining a vast, multi-task, multi-domain dataset with a stable and efficient training methodology, Salesforce AI Research has created a family of evaluators that are not only high-performing but also versatile and adaptable to various real-world AI development and deployment scenarios. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Scaling AI Evaluation: Salesforce AI Introduces Foundational Automatic Reasoning Evaluators (FARE)

A New Approach to Data and Training

Unprecedented Performance Across Benchmarks

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates