TLDR: BIRD-INTERACT is a new benchmark designed to evaluate large language models (LLMs) in multi-turn, interactive text-to-SQL scenarios, reflecting real-world database applications. It features a comprehensive environment with a function-driven user simulator, two evaluation settings (protocol-guided and agentic), and challenging tasks covering full CRUD operations with ambiguities and follow-up questions. Initial results show that even advanced LLMs struggle significantly, highlighting the need for improved interactive communication and strategic problem-solving capabilities.
Large language models, or LLMs, have shown impressive capabilities in converting natural language into SQL queries. However, most evaluations focus on single, perfectly formed questions. In the real world, interacting with databases often involves a back-and-forth conversation, where users might ask ambiguous questions, make mistakes, or change their minds. This dynamic interaction is crucial for practical database assistants, but existing benchmarks haven’t fully captured this complexity.
A new benchmark called BIRD-INTERACT has been introduced to address this gap. It aims to evaluate how well LLMs can handle these multi-turn, interactive scenarios, bringing a much-needed dose of realism to text-to-SQL evaluation. The researchers behind BIRD-INTERACT recognized that current benchmarks either treat conversation history as static information or limit evaluations to simple ‘read-only’ operations, which doesn’t reflect the full spectrum of challenges faced in real-world database applications.
Key Features of BIRD-INTERACT
BIRD-INTERACT introduces several innovative features to create a more realistic evaluation environment:
- **Comprehensive Interaction Environment**: Each database is paired with a hierarchical knowledge base, metadata files, and a unique function-driven user simulator. This setup allows LLMs to ask for clarifications, retrieve information, and even recover from execution errors without needing human supervision.
- **Two Evaluation Settings**: The benchmark offers two distinct ways to test LLMs. The ‘c-Interact’ setting follows a predefined conversational protocol, testing the model’s ability to stick to a structured dialogue. The ‘a-Interact’ setting is more open-ended, allowing the LLM to act as an autonomous agent, deciding when to query the user simulator or explore the database environment on its own.
- **Challenging Task Suite**: The tasks cover the full range of database operations, including Create, Read, Update, and Delete (CRUD), for both business intelligence and operational use cases. Each task includes ambiguous initial questions and follow-up sub-tasks that require dynamic interaction to resolve. The benchmark comes in two sizes: a ‘FULL’ set with 600 tasks for a comprehensive overview, and a ‘LITE’ set with 300 tasks for detailed behavioral analysis and faster development.
Initial Findings and Challenges
The initial results from BIRD-INTERACT highlight the significant difficulty of these interactive tasks for even the most advanced LLMs. For instance, GPT-5, a leading model, only completed 8.67% of tasks in the c-Interact setting and 17.00% in the a-Interact setting on the full task suite. This indicates a substantial gap between current LLM capabilities and the strategic interaction skills required for effective human-AI collaboration in database querying.
Further analysis revealed interesting insights:
- **Communication is Key**: Experiments showed that improving an LLM’s communication effectiveness significantly boosts its performance, even if its core SQL generation ability is strong.
- **Interaction Test-time Scaling**: Performance generally improves as models are given more opportunities to interact and clarify, suggesting that effective interaction leads to valuable information gains.
- **Agent Behavior**: In the agentic setting, models often favored direct code execution and asking the user over systematically exploring the database schema or knowledge base. This suggests a bias towards trial-and-error rather than strategic information gathering.
Also Read:
- Beyond Final Answers: TRAJECT-Bench Evaluates AI Agents’ Tool-Use Journeys
- Dataset Alignment: A Key to Successful LLM Fine-Tuning for Text-to-SQL
A More Reliable User Simulator
A critical component of BIRD-INTERACT is its novel two-stage function-driven user simulator. This simulator is designed to be more robust and reliable than previous versions, which sometimes leaked ground-truth information or deviated from task requirements. By mapping clarification requests to constrained symbolic actions before generating responses, the simulator ensures predictable and controllable behavior, aligning more closely with actual human-AI interaction patterns.
This benchmark represents a significant step forward in evaluating LLMs for real-world text-to-SQL applications, emphasizing the importance of dynamic interaction and problem-solving. For more in-depth information, you can read the full research paper here.


