TLDR: The FINCH research introduces a large-scale financial Text-to-SQL dataset (FINCH) with 292 tables and 75,725 natural language-SQL pairs. It also proposes a finance-oriented evaluation metric (FINCH Score) to better assess model performance in financial contexts. The paper benchmarks various language models, showing that domain-specific fine-tuning can outperform larger general-purpose models, and highlights persistent challenges in schema grounding and compositional reasoning.
The ability to translate everyday language into complex database queries, known as Text-to-SQL, has been a significant goal in natural language processing. While there has been considerable progress in this area, applying it to the financial sector presents unique challenges. Financial databases are often complex, use specialized terminology, and errors can have serious consequences. A critical gap has existed due to the lack of a large-scale, dedicated financial dataset to drive research in this domain.
To address this, researchers have introduced a new curated financial dataset called FINCH. This comprehensive dataset includes 292 tables and an impressive 75,725 pairs of natural language questions and their corresponding SQL queries. This resource is designed to facilitate both the fine-tuning of models and their rigorous evaluation within the financial context.
Building on this new dataset, the researchers benchmarked various reasoning models and language models of different sizes. This systematic analysis sheds light on the strengths and limitations of these models when performing Text-to-SQL tasks in finance. A key finding was that domain-specific fine-tuning can enable smaller models to achieve performance comparable to, or even surpass, much larger general-purpose language models.
Furthermore, the paper proposes a new evaluation metric specifically tailored for finance: the FINCH Score. This metric is designed to capture nuances often overlooked by existing measures, offering a more accurate assessment of model performance in financial applications. Traditional metrics like Exact Matching and Execution Accuracy can be overly strict, penalizing minor differences that hold no financial significance. The FINCH Score integrates component matching and execution accuracy with weighted scoring and tolerance thresholds. This means it gives more importance to critical SQL clauses like WHERE, JOIN, GROUP BY, HAVING, and AGG, which encode essential business rules and compliance filters. It also allows for small, financially immaterial deviations in results, aligning with the principle of materiality in accounting and regulatory standards.
The FINCH dataset was constructed by consolidating and refining existing resources such as BIRD, Spider, BULL, and BookSQL. A careful screening process ensured that only databases relevant to financial contexts, such as retail sales, card transactions, banking services, loans, insurance, and e-commerce, were retained. This meticulous curation involved validating every SQL query against its associated SQLite database, uncovering and correcting numerous anomalies that would have otherwise compromised the dataset’s reliability.
The final FINCH dataset boasts 33 databases, 292 tables, 2,233 columns, and 177 relations, with a total of over 75,725 natural language–SQL pairs. It also includes a diverse range of query difficulties, from easy to hard, and makes extensive use of complex SQL operations crucial for financial reasoning.
The benchmarking experiments evaluated large-scale language models (like Qwen3-235B-A22B and GPT-OSS-120B), medium and small-scale open-source models (Qwen3-8B and GPT-OSS-20B), and reasoning-centric models (Phi-4-mini-reasoning and Arctic-Text2SQL-R1-7B). The results indicated that GPT-OSS-120B achieved the strongest overall performance. However, Arctic-Text2SQL-R1-7B, despite its smaller size, emerged as the third-best performer, underscoring the significant benefits of domain-specific fine-tuning.
Analysis of error distribution across SQL clauses revealed that most errors occurred in SELECT, FROM, and WHERE clauses, highlighting persistent challenges in accurately mapping natural language queries to complex database schemas. Models also showed a sharp decline in accuracy when moving from easy to medium and hard queries, indicating limitations in compositional reasoning and multi-table joins.
Also Read:
- Generating Tailored Data for Text-to-SQL Systems: Introducing SING-SQL
- New Framework Measures Large Language Models on Geographic SQL Tasks
In conclusion, FINCH represents a significant step forward in financial Text-to-SQL research. It provides a robust, large-scale, finance-specific benchmark and a tailored evaluation metric that better reflects the practical requirements of the financial domain. This work lays a foundation for developing more reliable and accurate systems to support financial analysts, auditors, and policymakers in critical decision-making. For more details, you can refer to the research paper.


