FINCH Dataset and Metric: Bridging Natural Language and Financial SQL

TLDR: The FINCH research introduces a large-scale financial Text-to-SQL dataset (FINCH) with 292 tables and 75,725 natural language-SQL pairs. It also proposes a finance-oriented evaluation metric (FINCH Score) to better assess model performance in financial contexts. The paper benchmarks various language models, showing that domain-specific fine-tuning can outperform larger general-purpose models, and highlights persistent challenges in schema grounding and compositional reasoning.

The ability to translate everyday language into complex database queries, known as Text-to-SQL, has been a significant goal in natural language processing. While there has been considerable progress in this area, applying it to the financial sector presents unique challenges. Financial databases are often complex, use specialized terminology, and errors can have serious consequences. A critical gap has existed due to the lack of a large-scale, dedicated financial dataset to drive research in this domain.

To address this, researchers have introduced a new curated financial dataset called FINCH. This comprehensive dataset includes 292 tables and an impressive 75,725 pairs of natural language questions and their corresponding SQL queries. This resource is designed to facilitate both the fine-tuning of models and their rigorous evaluation within the financial context.

Building on this new dataset, the researchers benchmarked various reasoning models and language models of different sizes. This systematic analysis sheds light on the strengths and limitations of these models when performing Text-to-SQL tasks in finance. A key finding was that domain-specific fine-tuning can enable smaller models to achieve performance comparable to, or even surpass, much larger general-purpose language models.

Furthermore, the paper proposes a new evaluation metric specifically tailored for finance: the FINCH Score. This metric is designed to capture nuances often overlooked by existing measures, offering a more accurate assessment of model performance in financial applications. Traditional metrics like Exact Matching and Execution Accuracy can be overly strict, penalizing minor differences that hold no financial significance. The FINCH Score integrates component matching and execution accuracy with weighted scoring and tolerance thresholds. This means it gives more importance to critical SQL clauses like WHERE, JOIN, GROUP BY, HAVING, and AGG, which encode essential business rules and compliance filters. It also allows for small, financially immaterial deviations in results, aligning with the principle of materiality in accounting and regulatory standards.

The FINCH dataset was constructed by consolidating and refining existing resources such as BIRD, Spider, BULL, and BookSQL. A careful screening process ensured that only databases relevant to financial contexts, such as retail sales, card transactions, banking services, loans, insurance, and e-commerce, were retained. This meticulous curation involved validating every SQL query against its associated SQLite database, uncovering and correcting numerous anomalies that would have otherwise compromised the dataset’s reliability.

The final FINCH dataset boasts 33 databases, 292 tables, 2,233 columns, and 177 relations, with a total of over 75,725 natural language–SQL pairs. It also includes a diverse range of query difficulties, from easy to hard, and makes extensive use of complex SQL operations crucial for financial reasoning.

The benchmarking experiments evaluated large-scale language models (like Qwen3-235B-A22B and GPT-OSS-120B), medium and small-scale open-source models (Qwen3-8B and GPT-OSS-20B), and reasoning-centric models (Phi-4-mini-reasoning and Arctic-Text2SQL-R1-7B). The results indicated that GPT-OSS-120B achieved the strongest overall performance. However, Arctic-Text2SQL-R1-7B, despite its smaller size, emerged as the third-best performer, underscoring the significant benefits of domain-specific fine-tuning.

Analysis of error distribution across SQL clauses revealed that most errors occurred in SELECT, FROM, and WHERE clauses, highlighting persistent challenges in accurately mapping natural language queries to complex database schemas. Models also showed a sharp decline in accuracy when moving from easy to medium and hard queries, indicating limitations in compositional reasoning and multi-table joins.

Also Read:

In conclusion, FINCH represents a significant step forward in financial Text-to-SQL research. It provides a robust, large-scale, finance-specific benchmark and a tailored evaluation metric that better reflects the practical requirements of the financial domain. This work lays a foundation for developing more reliable and accurate systems to support financial analysts, auditors, and policymakers in critical decision-making. For more details, you can refer to the research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

FINCH Dataset and Metric: Bridging Natural Language and Financial SQL

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates