TLDR: This paper introduces a new framework that combines Group Relative Policy Optimization (GRPO) with a multilingual contrastive reward signal to improve Text-to-SQL systems across different languages. By focusing on semantic alignment in addition to execution correctness, the method enables a smaller Llama-3 3B model to achieve higher execution accuracy than a larger Llama-3 8B zero-shot model, and significantly boosts semantic accuracy, especially in non-English languages, using only 3,000 training examples.
A new research paper introduces an innovative framework designed to significantly enhance the accuracy and semantic understanding of Text-to-SQL systems, especially across multiple languages. Titled “Bridging the Semantic Gap: Contrastive Rewards for Multilingual Text-to-SQL,” this work addresses a critical challenge in current Text-to-SQL methods: the semantic alignment between a user’s natural language query and the generated SQL query, particularly when moving beyond English. Existing systems often experience a notable drop in performance, averaging 6 percentage points, when dealing with non-English languages. This research aims to close that gap by ensuring that the generated SQL not only executes correctly but also accurately reflects the user’s original intent.
The core of the proposed solution lies in combining Group Relative Policy Optimization (GRPO) with a novel multilingual contrastive reward signal. GRPO is a reinforcement learning algorithm that helps fine-tune language models in a stable and memory-efficient way. Unlike traditional methods that might struggle with the instability of large language models, GRPO provides a robust framework for learning. The researchers adapted GRPO, which was originally developed for mathematical reasoning tasks, to the Text-to-SQL domain.
The truly innovative aspect is the introduction of a contrastive reward signal. Instead of relying solely on whether a generated SQL query executes without errors (a binary “correct” or “incorrect” signal), this new reward provides continuous feedback on how closely the generated SQL’s meaning aligns with the user’s natural language query. This semantic reward is computed using a specially trained multilingual contrastive encoder, built upon XLM-RoBERTa-base. This encoder creates embeddings (numerical representations) for both the input question and the gold-standard SQL, then calculates their cosine similarity. A higher similarity score indicates better semantic alignment, guiding the model to understand the user’s true intent more accurately.
The framework integrates several feedback signals during training: an execution reward (binary, for correct results), a syntax reward (for executable queries), a schema-matching reward (for correct table and column usage), and the crucial semantic reward. By combining these, the model learns to produce SQL that is not only syntactically valid and executable but also deeply faithful to the user’s meaning across different languages.
Experiments were conducted on the MultiSpider dataset, which includes parallel queries in seven languages: Vietnamese, Spanish, Japanese, German, English, Chinese, and French. The researchers fine-tuned a Llama-3 3B model using their approach (L3B-GRPO-C). The results were remarkable: the Llama-3 3B model, when fine-tuned with the contrastive reward, achieved an average execution accuracy of 88.86% and a semantic accuracy of 59.14%. This represents a substantial improvement of +27.43 percentage points in execution accuracy and +39.71 percentage points in semantic accuracy over the zero-shot Llama-3 3B baseline.
Perhaps even more impressively, the fine-tuned Llama-3 3B model (L3B-GRPO-C) outperformed a much larger zero-shot Llama-3 8B model in average execution accuracy (88.86% vs. 81.43%). While the 8B model still held a lead in semantic accuracy, the 3B model significantly narrowed the gap. This demonstrates that targeted fine-tuning with semantic awareness can enable smaller, more resource-efficient models to achieve performance levels comparable to, or even exceeding, much larger models in zero-shot settings. The method achieved these gains with only 3,000 reinforcement learning training examples, highlighting its sample efficiency.
A qualitative example from Vietnamese queries illustrated the power of the contrastive reward. In one instance, a model without the contrastive reward generated SQL that executed successfully but had subtle semantic inaccuracies (e.g., using `>=` instead of `>` and not counting distinct movies). The model trained with the contrastive reward, however, correctly captured these nuances, producing SQL that precisely matched the user’s intent, even when both queries yielded the same results on a specific test database state. This highlights how execution accuracy alone can sometimes be misleading, and the semantic reward provides a deeper level of correctness.
Also Read:
- Advancing Conversational Database Interaction with MTSQL-R1’s Agentic Approach
- Enhancing Multi-Turn LLM Agents with Information Gain Rewards
Ablation studies further confirmed the importance of the contrastive reward and the choice of the XLM-RoBERTa encoder. Removing the contrastive reward or using a less capable encoder significantly reduced the semantic accuracy gains. This research paves the way for more accessible and resource-efficient high-quality multilingual Text-to-SQL systems, enabling users worldwide to interact with databases in their native languages with greater precision. You can read the full paper here.


