TLDR: The Falcon benchmark is a new, comprehensive dataset designed to evaluate Text-to-SQL models for Chinese enterprise environments. It features 600 Chinese questions across 28 databases, focusing on complex multi-table reasoning and supporting enterprise-specific SQL dialects like MaxCompute/Hive. Falcon addresses critical challenges such as schema linking in large, denormalized databases and the nuanced translation of colloquial Chinese into precise SQL. Initial evaluations show that even state-of-the-art large language models achieve less than 50% accuracy, highlighting significant areas for improvement in handling real-world enterprise data and Chinese linguistic complexities.
In the rapidly evolving landscape of data analytics, making structured data accessible to non-technical users is a significant challenge. The concept of Text-to-SQL, which translates human language queries into executable SQL commands, is crucial for democratizing data exploration and accelerating decision-making within organizations. However, real-world enterprise environments introduce unique complexities that often stump even the most advanced AI models.
A new research paper introduces “Falcon,” a groundbreaking benchmark designed to rigorously evaluate Chinese Text-to-SQL systems specifically for enterprise-grade applications. Developed by researchers from Ant Group, Falcon aims to bridge the gap between academic benchmarks and the practical demands of production workloads.
Addressing Enterprise Realities and Chinese Nuances
Existing Text-to-SQL benchmarks have made steady progress, but they often fall short in capturing the specific challenges of enterprise settings, especially when dealing with the Chinese language. Enterprise databases are typically vast, with hundreds of tables, denormalized fields, ambiguous column names, and domain-specific synonyms. Analysts often use concise, colloquial Chinese queries, incorporating business jargon and implicit meanings, which are difficult for models to translate into precise SQL operators and predicates.
Falcon directly tackles these issues. It features 600 Chinese questions spread across 28 diverse databases, with a significant portion (77%) requiring complex multi-table reasoning, and over half involving joins across more than four tables. The benchmark is meticulously annotated with both SQL-computation features and Chinese-specific semantics, allowing for detailed error analysis and model diagnostics.
A key innovation of Falcon is its support for enterprise-compatible SQL dialects like MaxCompute/Hive, which are widely used in large organizations. This ensures that models are evaluated on their ability to generate SQL that is not only semantically correct but also executable in real-world systems.
A Hybrid Approach to Data Sourcing and Robust Evaluation
To ensure both breadth and realism, Falcon employs a hybrid data sourcing strategy. It integrates 500 questions from curated public Kaggle datasets with 100 enterprise-inspired synthetic cases derived from anonymized Ant Group query patterns. This blend provides a diverse range of domains (finance, e-commerce, retail, etc.) while also capturing authentic business scenarios and linguistic phenomena prevalent in enterprise queries.
The evaluation system for Falcon is particularly robust. It includes a schema-aware SQL comparator that can handle common syntactic variations (like column aliases or projection ordering) while strictly verifying semantic equivalence. This content-based evaluation method ensures that models are judged on the correctness of their results, not just the exact textual match of the SQL query.
Also Read:
- Dynamic SQL Generation: How MTIR-SQL Enhances Text-to-SQL with Interactive Reasoning
- Bridging the Gap: Agentic AI Makes Spatio-Temporal Database Queries Accessible
Current Performance and Future Directions
The initial evaluation of 13 state-of-the-art large language models, including DeepSeek, on the Falcon benchmark revealed significant challenges. The best-performing model achieved an Exact Result Accuracy of only 45.2%, with the weakest reaching 20.2%. No model managed to exceed 50% accuracy, indicating substantial room for improvement.
Analysis of the results showed that accuracy sharply declines as the complexity of table joins increases. Queries involving four or more tables saw accuracy drop to just 21.43%, highlighting schema linking and multi-hop key propagation as major bottlenecks. This pattern is characteristic of enterprise schemas, where wide joins expose numerous semantically similar columns and create long dependency chains that challenge model reasoning.
The Falcon benchmark, detailed in the research paper available at arXiv:2510.24762, provides a critical tool for advancing Chinese Text-to-SQL capabilities. The findings underscore the need for models to develop better relation-aware schema encoding, lightweight preprocessing for Chinese linguistic complexities (ellipsis, coreference, business shorthand), and dialect-aware constrained decoding to truly excel in real-world enterprise applications.


