spot_img
HomeResearch & DevelopmentScaling AI Evaluation: Salesforce AI Introduces Foundational Automatic Reasoning...

Scaling AI Evaluation: Salesforce AI Introduces Foundational Automatic Reasoning Evaluators (FARE)

TLDR: Salesforce AI Research has developed FARE, a new family of 8B and 20B parameter AI evaluators. Trained on a massive 2.5 million sample dataset across five evaluation tasks and multiple reasoning domains using a simple iterative rejection-sampling finetuning approach, FARE-8B and FARE-20B demonstrate superior performance over existing specialized and larger evaluators. They prove highly effective in real-world applications like inference-time reranking, reinforcement learning verification, and domain-specific finetuning, setting a new standard for open-source evaluation.

Salesforce AI Research has unveiled a new family of automatic evaluators called Foundational Automatic Reasoning Evaluators (FARE), marking a significant advancement in how large language models (LLMs) are assessed. This research, detailed in their preprint titled “FOUNDATIONAL AUTOMATIC EVALUATORS: SCALING MULTI-TASK GENERATIVE EVALUATOR TRAINING FOR REASONING-CENTRIC DOMAINS,” addresses the growing need for scalable and versatile evaluation methods in the rapidly evolving field of AI. The paper was authored by Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, and Shafiq Joty.

The core challenge FARE aims to tackle is the increasing demand for efficient and accurate evaluation of LLM outputs, both during training and at test time. While previous efforts have often focused on new methodologies like reinforcement learning (RL) for training evaluators, the Salesforce AI team emphasized a data-driven approach, curating an extensive dataset of 2.5 million samples.

A New Approach to Data and Training

The FARE project distinguishes itself by focusing on large-scale, multi-task, and multi-domain data. This massive dataset spans five unique evaluation tasks: pairwise comparisons, step-level evaluation, reference-free verification, reference-based verification, and single rating. Crucially, it covers diverse domains with a strong emphasis on reasoning, including math, code, tool-use evaluation, and natural language reasoning. This comprehensive data collection ensures that FARE evaluators are well-rounded and capable of handling a wide array of scenarios.

For training, the researchers employed a straightforward yet effective iterative rejection-sampling supervised finetuning (SFT) approach. This method, known as RS-SFT, offers a stable and computationally efficient way to train evaluators at scale, avoiding common issues like distribution shifts that can arise with other training paradigms. The process involves generating multiple responses from the policy model, selecting the correct ones based on ground-truth judgments, and then using these correct responses to update the model.

Also Read:

Unprecedented Performance Across Benchmarks

The FARE family includes two models: FARE-8B and FARE-20B (with 3.6B active parameters). These models have demonstrated remarkable performance, challenging and often surpassing larger, more specialized evaluators. FARE-8B, despite its smaller size, competes effectively with larger RL-trained evaluators. FARE-20B, on the other hand, sets a new benchmark for open-source evaluators, outperforming specialized models with over 70 billion parameters.

Beyond static benchmarks, FARE was rigorously evaluated in real-world applications:

  • Inference-time Reranking: When used as rerankers during inference, FARE-20B achieved near-oracle performance on complex math problems (MATH benchmark), significantly boosting the quality of generated responses.
  • RL Training Verification: As verifiers in reinforcement learning training, FARE models improved the performance of downstream RL-trained models by up to 14.1% compared to traditional string-matching verifiers. This highlights their ability to provide more nuanced and effective feedback for model improvement.
  • Domain-Specific Finetuning: When initialized from FARE, a continually finetuned version called FARE-Code outperformed gpt-oss-20B by an impressive 65% in evaluating test-case quality for code generation, demonstrating its adaptability to specific domains with minimal additional training.

The design philosophy behind FARE emphasizes efficiency and precision. The models are built for low-latency evaluation, crucial for tasks like inference-time reranking. They also utilize compact “thinking” chains-of-thought (CoTs) and are designed to avoid generating reference answers themselves, which can sometimes degrade performance if the generated reference is incorrect.

The data curation process combined existing high-quality datasets with synthetically generated data. Synthetic data was created through programmatic error injection (especially for tool-use scenarios) and a “generate-then-grade” strategy, where responses from various generator models were graded against verifiable ground-truth answers. This blend ensured a diverse and challenging training set.

In conclusion, FARE represents a significant leap forward in automatic evaluation for LLMs. By combining a vast, multi-task, multi-domain dataset with a stable and efficient training methodology, Salesforce AI Research has created a family of evaluators that are not only high-performing but also versatile and adaptable to various real-world AI development and deployment scenarios. For more details, you can read the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -