spot_img
HomeResearch & DevelopmentCollaborative AI Agents Boost Text-to-SQL Performance in Open-Source Models

Collaborative AI Agents Boost Text-to-SQL Performance in Open-Source Models

TLDR: This research paper introduces BAPPA, a benchmark for evaluating multi-agent LLM pipelines for Text-to-SQL generation. It explores three novel pipelines—Multi-Agent Discussion, Planner-Coder, and Coder-Aggregator—demonstrating that collaborative AI agents can significantly improve SQL generation accuracy and reliability, especially for smaller and open-source language models, making database interaction more accessible. The Planner-Coder pipeline, in particular, showed substantial gains, with models like Gemma 3 27B IT achieving up to 56.4% execution accuracy on the BIRD dataset when guided by strong planners.

Accessing information stored in databases often requires specialized knowledge of SQL, limiting its use for many. Text-to-SQL systems aim to bridge this gap by allowing users to query databases using natural language. However, current Large Language Models (LLMs) frequently struggle with this task, especially when dealing with large database schemas and complex reasoning requirements. Much of the prior research has focused on complex, often impractical pipelines using very large, proprietary models, leaving smaller, more efficient open-source models largely unexplored.

A new research paper titled “BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation” by Fahim Ahmed, Md Mubtasim Ahasan, Jahir Sadik Monon, Muntasir Wahed, M Ashraful Amin, A K M Mahbubur Rahman, and Amin Ahsan Ali, delves into the potential of multi-agent LLM pipelines to enhance Text-to-SQL generation. This work systematically benchmarks the performance of various small to large open-source models within these collaborative frameworks. You can find the full paper here: BAPPA Research Paper.

Addressing Key Challenges

The researchers identified two main challenges: the underexplored potential of multi-agent LLM pipelines for direct Text-to-SQL generation, and the lack of systematic benchmarking for recent open-source LLMs across different scales. Existing Text-to-SQL systems often rely on highly specialized, fine-tuned models for subtasks, while the power of LLMs collaborating through role specialization and critique has been largely overlooked for SQL generation.

To tackle these issues, the paper proposes three innovative multi-agent LLM pipelines:

1. Multi-Agent Discussion Pipeline

In this setup, three agents, each with a distinct persona (Simple, Technical, Thinker), engage in an iterative discussion. They critique and refine SQL queries generated by each other across multiple rounds. A central ‘Judge’ agent then synthesizes the final SQL query based on the consensus. This collaborative critique mechanism helps to improve the robustness of the generated SQL and mitigate common errors seen in single-shot generation.

2. Planner-Coder Pipeline

This pipeline separates the reasoning and execution phases. A ‘Planner Agent,’ typically a thinking model, first analyzes the database schema and user query to generate a structured, step-by-step plan for constructing the SQL. Subsequently, a ‘Coder Agent’ takes this plan and the schema to synthesize the final SQL query. This design allows for transparent reasoning and can significantly guide the code generation process, especially for complex queries.

3. Coder-Aggregator Pipeline

Here, multiple ‘Coder Agents’ independently generate candidate SQL queries, each accompanied by its reasoning trace. A single ‘Aggregator Agent’ then evaluates and integrates these diverse outputs to select or synthesize the best final query. This approach leverages multiple perspectives during inference, enhancing factual consistency and execution accuracy through a form of self-critique and consensus among the generated candidates.

Extensive Benchmarking and Key Findings

The research conducted an extensive evaluation across 24 open-source LLMs, ranging from 4B to 34B parameters, including models like Qwen2, Gemma 3, CodeLLaMA, DeepSeek, and StarCoder. Experiments were performed on the challenging BIRD Mini-Dev and Spider Dev datasets.

The findings revealed several important insights:

  • Even in zero-shot settings (without specific training examples), models like Gemma 3 (27B IT) and Qwen2.5-Coder (14B Instruct) demonstrated strong performance, often outperforming proprietary models like GPT-4 Turbo on the BIRD dataset. This highlights the rapid advancements in open instruction-tuned systems.
  • The Multi-Agent Discussion pipeline showed stable, albeit modest, gains across dialogue rounds. It particularly benefited mid-scale models, with Qwen2.5-Coder-14B-Instruct seeing a substantial improvement in Execution Accuracy on BIRD.
  • The Planner-Coder pipeline proved to be a significant enhancer, especially for weaker coding models. Leveraging reasoning-oriented plans consistently improved SQL generation accuracy. The highest results were achieved when combining plans from multiple strong planner agents, with Gemma 3 27B IT reaching an impressive 56.4% Execution Accuracy on BIRD.
  • The Coder-Aggregator pipeline consistently improved SQL accuracy through ensemble reasoning, particularly for smaller and mid-scale coder sets. Aggregators like QwQ-32B and DeepSeek-R1-Distill-Qwen-14B demonstrated strong performance in consolidating diverse SQL candidates.

Also Read:

Conclusion

The BAPPA research underscores the immense potential of collaborative, multi-agent pipelines in enhancing Text-to-SQL generation. By decomposing SQL generation into specialized agentic subtasks like planning, critique, and aggregation, even smaller and mid-sized open-source language models can achieve performance comparable to or even surpassing much larger, proprietary systems. This work lays a strong foundation for developing more efficient, accessible, and reliable Text-to-SQL systems for real-world deployment.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -