TLDR: A new research paper introduces ‘siwarex,’ a declarative system that significantly improves how Large Language Models (LLMs) handle natural language queries over diverse data sources, including both databases and APIs. By treating APIs as User Defined Functions within SQL, the system unifies data access and leverages SQL’s optimization capabilities. Experiments on new benchmarks show this declarative method outperforms imperative and agent-based approaches in accuracy and robustness for heterogeneous data environments.
In today’s industrial landscape, asking questions in natural language and getting answers that pull information from various structured data sources—like spreadsheets, databases, and APIs—is a common need. While Large Language Models (LLMs) have made strides in translating natural language into executable code for databases or APIs, they often fall short when faced with the complex reality of heterogeneous data environments. This means systems struggle to combine information from different types of sources effectively.
A recent research paper, titled “Declarative Techniques for NL Queries over Heterogeneous Data,” by Elham Khabiri, Jeffrey O. Kephart, Fenno F. Heath III, Srideepika Jayaraman, Fateh A. Tipu, Yingjie Li, Dhruv Shah, Achille Fokoue, and Anu Bhamidipaty, addresses this critical challenge. The authors introduce a novel declarative approach designed to handle data heterogeneity significantly better than existing LLM-based agentic or imperative code generation systems. You can read the full paper here: Declarative Techniques for NL Queries over Heterogeneous Data.
The Challenge of Diverse Data Sources
Current LLM-based applications often struggle because they conflate a user’s intent with the complex planning required to execute queries across different data types. Imagine asking, “Which Xylem pumps at Bedford have experienced anomalous temperatures today?” This requires both a database call to find pumps by manufacturer and location, and an API call to check temperature anomalies. Existing agent-based architectures, like ReAct, can orchestrate such tasks but tend to be brittle, expensive, and difficult to scale in real-world production settings.
Introducing a Declarative Solution: siwarex
The researchers propose a more practical architecture called siwarex, which cleanly separates the user’s intent from the execution planning. This system leverages SQL as a declarative language to express user intent and uses User Defined Functions (UDFs) to invoke APIs directly from within SQL queries. By doing so, APIs are treated on the same footing as database tables, allowing the system to utilize decades of research in SQL query optimization for efficient orchestration and aggregation across both databases and APIs.
The siwarex framework relies on two key schemas:
- Abstract Schema: Provides a global view of data source properties and relationships, agnostic to whether the source is a database table or an API.
- API Mapping Schema: Contains details needed to invoke an API, such as URL, method (POST, GET), and input/output parameters.
These schemas allow a standard Text-to-SQL module to generate SQL queries, even for virtual tables representing APIs. A rule-based Query Rewriter then transforms these queries into executable SQL by replacing virtual tables with their corresponding UDFs, ensuring proper argument passing.
Benchmarking the Approaches
To rigorously test their declarative approach against imperative code generation and agent-based systems, the authors created two new benchmarks, extending the popular Spider dataset:
- Benchmark I: Replaces a fraction of real Spider database tables with equivalent API calls, requiring systems to combine database and API interactions.
- Benchmark II: Introduces 16 scalar APIs for lexical, numeric, or geospatial operations, transforming existing Spider questions to require interleaving database operations with compositions of these APIs.
Key Findings
Experiments on these new benchmarks demonstrated that the declarative approach significantly outperforms both imperative and agent-based methods, especially when dealing with a mixture of database and API calls. The agentic method, for instance, struggled with sequencing multiple API calls, routing questions to the correct tools, and often hallucinated or improperly bound inputs. The imperative approach, while strong with pure database queries, became less accurate as the mix of APIs and databases became more even, due to the increased complexity of generated Python code.
The declarative approach, by presenting a unified relational view to the LLM, allows it to leverage its Text-to-SQL capabilities more effectively, with the Query Rewriter handling the complexities of API invocation. This separation of concerns leads to more robust and accurate results.
Also Read:
- JudgeSQL: Enhancing Text-to-SQL Accuracy Through Intelligent Selection
- Querying 3D Scene Graphs: A New Interface for Robot Language Understanding
Looking Ahead
This research marks a significant step towards making natural language queries over heterogeneous data sources practical in industrial settings. The authors have released their augmented benchmarks to the research community, encouraging further advancements in this crucial area. Future work aims to address limitations such as evaluating execution performance on larger datasets and incorporating APIs that produce vector or table outputs, further extending the applicability of their declarative framework.


