TLDR: LLMSQL is a systematically revised and cleaned version of the WikiSQL dataset, designed to be a more reliable benchmark for evaluating Large Language Models (LLMs) in Text-to-SQL tasks. The original WikiSQL suffered from issues like data type mismatches, case sensitivity, and non-intuitive SQL formats. LLMSQL addresses these by automated and manual corrections, providing clean natural language questions and standard SQL queries. Evaluations show that LLMs benefit significantly from this improved dataset, with even smaller models achieving high accuracy after fine-tuning, making LLMSQL a crucial resource for advancing natural language database interfaces.
The ability to convert natural language questions into structured database queries, known as Text-to-SQL, is a crucial technology for allowing everyday users to interact with complex databases without needing to learn programming languages like SQL. For a long time, the WikiSQL dataset was a cornerstone for research in this area, providing over 80,000 pairs of questions and their corresponding SQL queries.
However, despite its widespread use, WikiSQL has faced significant challenges that have limited its effectiveness for modern research, especially with the rise of large language models (LLMs). These issues include inconsistencies in case sensitivity, mismatches in data types, syntax errors in the provided SQL, and questions that, even with correct queries, yield no answers. These problems often led to misleading performance metrics for models and poor generalization in real-world applications.
A new research paper introduces LLMSQL, a comprehensive overhaul and transformation of the WikiSQL dataset specifically designed for the LLM era. The goal of LLMSQL is not just to update an old resource, but to create a reliable, LLM-ready benchmark that provides clean natural language questions and full SQL queries in plain text. This makes it much easier for modern Text-to-SQL models to generate and evaluate queries.
The researchers systematically analyzed WikiSQL to pinpoint the core issues affecting query execution accuracy. They then implemented both automated and manual methods to clean and re-annotate the dataset. This involved addressing incomplete information in tables by manually adding missing column names, resolving datatype conflicts by converting values to their correct types (e.g., numbers stored as strings to actual numbers), and removing duplicate tables and questions to ensure genuine diversity.
A significant effort was made to tackle the problem of empty query results. Nearly half of the original WikiSQL queries returned no results, with a large portion of these due to simple case sensitivity mismatches in string literals. The LLMSQL cleaning process automatically adjusted the case of SQL string literals to match the natural language questions and table values, drastically reducing these empty results. Furthermore, the non-intuitive, numeric-placeholder SQL format of WikiSQL was replaced with standard, human-readable SQL queries, making the dataset compatible with any standard SQL database.
While LLMSQL resolves many critical issues, the authors acknowledge some remaining challenges, such as occasional mismatches in aggregation operators where the intended operation (e.g., SUM instead of COUNT) might not align with the provided SQL. These semantic nuances are harder to automate and often require contextual understanding.
To demonstrate the impact of these improvements, the researchers evaluated several large language models, including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1, and others, on the new LLMSQL benchmark. They tested models in 0-shot, 1-shot, and 5-shot settings, where models receive no, one, or five examples, respectively, to guide their query generation. The evaluation used execution accuracy, meaning a query was considered correct if it produced the exact same results as the ground truth query when run against an SQLite database.
The findings showed that model performance generally improved with more examples (from 0-shot to 5-shot), highlighting the benefit of in-context learning for LLMs. Larger models like DeepSeek R1 and OpenAI o4-mini achieved high accuracies, often above 85%, even in 0-shot settings, suggesting they quickly adapt to task instructions. Interestingly, fine-tuning experiments revealed that even relatively smaller models could achieve over 90% execution accuracy when specifically trained on the LLMSQL dataset, demonstrating the power of targeted adaptation.
The authors emphasize that LLMSQL, despite focusing on single-table queries, remains a highly relevant benchmark. Real-world SQL usage often involves simpler query patterns, making mastery of these fundamental clauses critical for practical deployment. LLMSQL is presented as a complementary resource to other benchmarks like Spider (for multi-table queries) and BIRD (for noisy annotations), providing a large-scale, single-table benchmark with validated annotations.
Also Read:
- How Well Do LLMs Tutor? A New Benchmark Reveals Strengths and Weaknesses
- How Language Models Learn to Balance Internal Knowledge with New Information
The research paper, available at arxiv.org/pdf/2510.02350, concludes by outlining future work, including adding more questions to tables, introducing JOIN queries for increased complexity, incorporating new data types like dates, and expanding multilingual support. These enhancements aim to further develop LLMSQL into a comprehensive and practical resource for advancing natural language interfaces to databases.


