Collaborative AI Agents Boost Text-to-SQL Performance in Open-Source Models

TLDR: This research paper introduces BAPPA, a benchmark for evaluating multi-agent LLM pipelines for Text-to-SQL generation. It explores three novel pipelines—Multi-Agent Discussion, Planner-Coder, and Coder-Aggregator—demonstrating that collaborative AI agents can significantly improve SQL generation accuracy and reliability, especially for smaller and open-source language models, making database interaction more accessible. The Planner-Coder pipeline, in particular, showed substantial gains, with models like Gemma 3 27B IT achieving up to 56.4% execution accuracy on the BIRD dataset when guided by strong planners.

Accessing information stored in databases often requires specialized knowledge of SQL, limiting its use for many. Text-to-SQL systems aim to bridge this gap by allowing users to query databases using natural language. However, current Large Language Models (LLMs) frequently struggle with this task, especially when dealing with large database schemas and complex reasoning requirements. Much of the prior research has focused on complex, often impractical pipelines using very large, proprietary models, leaving smaller, more efficient open-source models largely unexplored.

A new research paper titled “BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation” by Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin, A K M Mahbubur Rahman, and Amin Ahsan Ali, delves into the potential of multi-agent LLM pipelines to enhance Text-to-SQL generation. This work systematically benchmarks the performance of various small to large open-source models within these collaborative frameworks. You can find the full paper here: BAPPA Research Paper.

Addressing Key Challenges

The researchers identified two main challenges: the underexplored potential of multi-agent LLM pipelines for direct Text-to-SQL generation, and the lack of systematic benchmarking for recent open-source LLMs across different scales. Existing Text-to-SQL systems often rely on highly specialized, fine-tuned models for subtasks, while the power of LLMs collaborating through role specialization and critique has been largely overlooked for SQL generation.

To tackle these issues, the paper proposes three innovative multi-agent LLM pipelines:

1. Multi-Agent Discussion Pipeline

In this setup, three agents, each with a distinct persona (Simple, Technical, Thinker), engage in an iterative discussion. They critique and refine SQL queries generated by each other across multiple rounds. A central ‘Judge’ agent then synthesizes the final SQL query based on the consensus. This collaborative critique mechanism helps to improve the robustness of the generated SQL and mitigate common errors seen in single-shot generation.

2. Planner-Coder Pipeline

This pipeline separates the reasoning and execution phases. A ‘Planner Agent,’ typically a thinking model, first analyzes the database schema and user query to generate a structured, step-by-step plan for constructing the SQL. Subsequently, a ‘Coder Agent’ takes this plan and the schema to synthesize the final SQL query. This design allows for transparent reasoning and can significantly guide the code generation process, especially for complex queries.

3. Coder-Aggregator Pipeline

Here, multiple ‘Coder Agents’ independently generate candidate SQL queries, each accompanied by its reasoning trace. A single ‘Aggregator Agent’ then evaluates and integrates these diverse outputs to select or synthesize the best final query. This approach leverages multiple perspectives during inference, enhancing factual consistency and execution accuracy through a form of self-critique and consensus among the generated candidates.

Extensive Benchmarking and Key Findings

The research conducted an extensive evaluation across 24 open-source LLMs, ranging from 4B to 34B parameters, including models like Qwen2, Gemma 3, CodeLLaMA, DeepSeek, and StarCoder. Experiments were performed on the challenging BIRD Mini-Dev and Spider Dev datasets.

The findings revealed several important insights:

Even in zero-shot settings (without specific training examples), models like Gemma 3 (27B IT) and Qwen2.5-Coder (14B Instruct) demonstrated strong performance, often outperforming proprietary models like GPT-4 Turbo on the BIRD dataset. This highlights the rapid advancements in open instruction-tuned systems.
The Multi-Agent Discussion pipeline showed stable, albeit modest, gains across dialogue rounds. It particularly benefited mid-scale models, with Qwen2.5-Coder-14B-Instruct seeing a substantial improvement in Execution Accuracy on BIRD.
The Planner-Coder pipeline proved to be a significant enhancer, especially for weaker coding models. Leveraging reasoning-oriented plans consistently improved SQL generation accuracy. The highest results were achieved when combining plans from multiple strong planner agents, with Gemma 3 27B IT reaching an impressive 56.4% Execution Accuracy on BIRD.
The Coder-Aggregator pipeline consistently improved SQL accuracy through ensemble reasoning, particularly for smaller and mid-scale coder sets. Aggregators like QwQ-32B and DeepSeek-R1-Distill-Qwen-14B demonstrated strong performance in consolidating diverse SQL candidates.

Also Read:

Conclusion

The BAPPA research underscores the immense potential of collaborative, multi-agent pipelines in enhancing Text-to-SQL generation. By decomposing SQL generation into specialized agentic subtasks like planning, critique, and aggregation, even smaller and mid-sized open-source language models can achieve performance comparable to or even surpassing much larger, proprietary systems. This work lays a strong foundation for developing more efficient, accessible, and reliable Text-to-SQL systems for real-world deployment.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Collaborative AI Agents Boost Text-to-SQL Performance in Open-Source Models

Addressing Key Challenges

1. Multi-Agent Discussion Pipeline

2. Planner-Coder Pipeline

3. Coder-Aggregator Pipeline

Extensive Benchmarking and Key Findings

Conclusion

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates