GradeSQL: Enhancing Text-to-SQL Performance with Outcome Reward Models

TLDR: GradeSQL introduces a novel framework for training Outcome Reward Models (ORMs) to rank SQL queries generated by Large Language Models (LLMs). By focusing on semantic correctness rather than surface-level heuristics, GradeSQL significantly improves Text-to-SQL accuracy on benchmarks like BIRD and Spider, outperforming traditional methods like Best-of-N and Majority Voting. The research demonstrates ORMs’ consistent trainability across various LLM families, benefits from dataset balancing, and robust performance with larger generator models, establishing ORMs as a more reliable approach for accurate SQL generation.

Large Language Models (LLMs) have significantly advanced the field of Text-to-SQL, which involves translating natural language questions into SQL queries. This progress has made databases more accessible to a wider range of users. However, LLMs still face challenges when dealing with complex queries that require a precise understanding of user intent and the database structure.

To address these limitations, researchers often use test-time strategies like Best-of-N (BoN) and Majority Voting (Maj). These methods assume that LLMs can generate correct answers but might need multiple attempts. BoN typically selects the syntactically correct query, while Maj chooses the most frequently generated one. The issue with these approaches is their reliance on surface-level heuristics, which don’t always guarantee semantic correctness – meaning the query truly reflects the user’s intention.

A new research paper, titled “GradeSQL: Outcome Reward Models for Ranking SQL Queries from Large Language Models” by Mattia Tritto, Giuseppe Farano, Dario Di Palma, Gaetano Rossiello, Fedelucio Narducci, Dharmashankar Subramanian, and Tommaso Di Noia, introduces a novel framework to tackle this problem. The paper explores the use of Outcome Reward Models (ORMs) as a more effective heuristic for BoN. ORMs assign utility scores to generated outputs based on their semantic correctness, aligning model predictions more closely with user intent.

The GradeSQL Framework

The GradeSQL framework consists of three main stages designed to train ORMs for the Text-to-SQL task:

Candidate Generation: For a given natural language question and database schema, a powerful LLM (the generator) produces a diverse set of N candidate SQL queries. This pool includes both correct and incorrect but syntactically valid queries.
Data Labeling: Each candidate query is then labeled as correct or incorrect. A query is deemed correct if its execution result matches that of the gold-standard query. Incorrect queries either return different results or cause execution errors. This process creates a dataset of positive and negative examples.
Supervised Fine-Tuning (SFT): A separate LLM is fine-tuned using this labeled dataset to act as the ORM. The ORM learns to score candidate SQL queries based on their alignment with the original user intent and database schema. At inference, this trained ORM functions as a post-generation re-ranking module, selecting the most semantically faithful query from the candidate pool.

The researchers evaluated their ORMs on two widely used Text-to-SQL benchmarks: BIRD and SPIDER. They fine-tuned various open-source LLMs, including Qwen2, Granite3, and Llama3 model families, to serve as ORMs. The results were compelling: ORMs consistently outperformed execution-based BoN and Majority Voting, achieving significant execution accuracy gains. For instance, on the BIRD benchmark, ORMs showed an improvement of +4.33% over ex-BoN and +2.91% over Maj. On the Spider benchmark, gains were +2.10% over ex-BoN and +0.93% over Maj.

Also Read:

Key Findings and Insights

The study revealed several important insights:

ORMs are consistently trainable across different LLM families and scales, showing comparable accuracies with minimal variance.
Dataset balancing (ensuring an equal split of correct and incorrect labels during training) not only reduces data requirements and training time but can also yield competitive, and sometimes superior, performance.
ORMs demonstrate greater robustness on complex queries compared to traditional methods and benefit more substantially from an increased number of candidate queries.
The effectiveness of ORMs is not diminished by scaling the underlying generator model to larger parameter sizes; they continue to provide robust gains.
Prompt design for ORM training is crucial, with a “SQL-only” prompt (exposing the verifier directly to SQL structures) proving most effective, especially for moderate and challenging queries.
Scaling the ORM verifier beyond 7B parameters yields only marginal and inconsistent improvements, suggesting that smaller to medium-sized verifiers are sufficient for practical deployment.
Autoregressive fine-tuning consistently delivers superior results compared to Binary Cross-Entropy (BCE) training, establishing it as the most effective approach for aligning ORMs with Text-to-SQL verification.

In conclusion, GradeSQL offers a principled and semantically aligned mechanism for candidate selection in Text-to-SQL, moving beyond surface-level heuristics. The researchers have publicly released all code, datasets, and trained models to support reproducibility and encourage further research in this area. You can find the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

GradeSQL: Enhancing Text-to-SQL Performance with Outcome Reward Models

The GradeSQL Framework

Key Findings and Insights

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates