TLDR: Thinkquel is a new AI model that converts natural language into dbt (Data Build Tool) code for data transformations. It uses a unique synthetic data pipeline (TS-SQL) to generate high-quality training data and a novel reinforcement learning objective (TS-GRPO) that optimizes planning and code generation separately. This approach leads to significantly improved accuracy and stability, making it easier for users to query complex databases without specialized SQL expertise.
In the rapidly evolving landscape of artificial intelligence, the ability to translate natural language into executable code is a significant frontier. A new research paper introduces “Thinkquel,” a novel model designed to tackle the complex challenge of converting natural language requests into production-ready data transformations, specifically using dbt (Data Build Tool).
The paper, titled “Thinkquel: A Model Dedicated to Text-to-dbt Using Synthetic Data and a Span-Aware Objective,” highlights the inherent difficulties in this task. Unlike more forgiving programming languages, SQL and dbt demand extreme precision in schema linking, adherence to specific SQL dialects, and accurate query-level semantics. Even minor errors can lead to complete execution failure. Furthermore, creating high-quality, execution-validated training data is both expensive and scarce, making it difficult for large language models (LLMs) to learn effectively.
Thinkquel addresses these challenges through two primary innovations: a scalable, diverse synthetic data generation pipeline called TS-SQL, and a specialized training objective known as Token–Sequence GRPO (TS–GRPO).
The TS-SQL Data Pipeline: A Foundation of Quality Data
To overcome the scarcity of training data, the researchers developed TS-SQL, a rigorous pipeline that programmatically generates, refines, executes, and curates pairs of natural language requests and dbt models. This pipeline starts by creating millions of dbt model configurations, systematically varying structural parameters and SQL features. These models are then executed against target databases, with any containing errors or timeouts being filtered out.
A crucial step involves semantic refinement. Initially, generated models might have generic names like ‘CTE1’ or ‘col1’. Using advanced LLMs like Qwen3-Coder-480B, these are transformed into meaningful identifiers, enhancing the overall logic. After refinement, models are re-executed to ensure continued validity. Natural language questions are then generated for each validated dbt model, with variations in description and syntax requirements.
Quality control is paramount. Anthropic’s claude-sonnet-4-20250514 is employed to evaluate each question-model pair for clarity, semantic alignment, and technical correctness. Only pairs scoring 9/10 or higher pass the final filtering, ensuring a high standard of training data. This curated dataset is then partitioned, with examples yielding non-empty results feeding into reinforcement learning, and others supporting supervised fine-tuning.
Thinkquel’s Training Methodology: Planning for Precision
Thinkquel’s training incorporates a unique “plan-before-SQL” approach. Instead of generating verbose, free-form thoughts, the model is trained to first produce a concise, structured plan. This plan explicitly lists source tables and columns in a YAML-like format before generating the final dbt code. This structured planning significantly improves schema grounding, reduces hallucinations, and makes the planning process verifiable, allowing for objective rewards based on plan quality.
The training proceeds in two stages of Supervised Fine-Tuning (SFT). The first stage provides a base ability for text-to-dbt tasks, while the second stage refines this by incorporating plan-augmented instances and general instruction-following data, helping the model retain broad conversational abilities.
Reinforcement Learning (RL) further enhances Thinkquel. The RL signal is a composite of multiple rewards, designed to align the model’s behavior with execution-grounded correctness and encourage good planning. These rewards include checks for correct formatting, accurate schema linking (tables and columns), adherence of the generated dbt to the plan, successful execution, and exact result matching against gold standards. These rewards are strategically split: SQL-span rewards focus on execution and result matching, while Plan-span rewards focus on format and schema linking.
Token–Sequence GRPO (TS–GRPO): Bridging the Granularity Gap
The core of Thinkquel’s advanced training lies in TS–GRPO, a novel span-aware reinforcement learning objective. Traditional methods often struggle with a “granularity mismatch” – the most critical feedback (execution success) is sequence-level, but updates are often token-level, leading to instability.
TS–GRPO addresses this by treating the model’s output as two distinct spans: a reasoning span (the plan) and an answer span (the dbt code). It computes separate advantages for rewards associated with each span. Crucially, it applies a sequence-level, length-normalized importance ratio only to the dbt code span, ensuring that the unit of credit assignment matches the sequence-level nature of SQL-related rewards. For the reasoning span, it retains token-level importance ratios, which are more suitable for local, structural signals like schema linking. This dual approach, along with support for asymmetric clipping (tighter for SQL, looser for plans), significantly reduces variance and prevents credit leakage between the plan and the code, leading to more stable and efficient learning.
Impressive Performance
Thinkquel demonstrates impressive results across various benchmarks. On the Spider dataset, TS–GRPO showed faster and steadier convergence of execution-match rewards compared to existing methods like GSPO and GRPO. On the 500-example TS–SQL test set, Thinkquel (32B) achieved 93.2% execution success and 61.8% exact-result match, a substantial improvement over the base model. Even on the out-of-domain BIRD-dbt dataset, Thinkquel maintained strong performance, reaching 73.5% match at 92.9% execution, proving its robustness and portability.
The researchers note that while the two-stage SFT curriculum provides the initial significant boost in capability, TS–GRPO plays a vital role in tightening execution-aligned optimization and closing the remaining performance gap. The explicit planning mechanism also provides measurable benefits in schema grounding and reduces error propagation.
Also Read:
- Generating Tailored Data for Text-to-SQL Systems: Introducing SING-SQL
- DATAMIND: A New Recipe for High-Performing Open-Source Data Analysis AI Agents
Looking Ahead
While Thinkquel represents a significant leap forward, the authors acknowledge areas for future improvement. Residual failures often stem from schema reference errors, indicating a need for even more robust schema linking. Future work will focus on wider dataset coverage, more realistic question styles, integrating reinforcement learning with tool use (e.g., schema inspection), and extending evaluation across multiple data warehouses to enhance cross-warehouse portability. You can read the full research paper here.


