TOUCAN: A New Frontier in Training Language Model Agents with Real-World Tool Data

TLDR: TOUCAN is the largest open-source tool-agentic dataset, featuring 1.5 million trajectories synthesized from nearly 500 real-world Model Context Protocols (MCPs). It addresses the lack of high-quality training data for LLM agents by providing diverse, realistic, and complex multi-tool and multi-turn interactions with real tool execution. Models fine-tuned on TOUCAN demonstrate superior performance on various benchmarks, outperforming larger and closed-source counterparts, thereby advancing the development of more capable and efficient LLM agents.

In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) are becoming increasingly sophisticated, acting as powerful agents capable of automating complex tasks across various domains. However, a significant challenge for the open-source community has been the scarcity of high-quality, permissively licensed training data specifically designed for these tool-agentic LLMs. Existing datasets often fall short in terms of diversity, realism, and complexity, particularly when it comes to interactions involving multiple tools and multiple turns in a conversation.

Addressing this critical gap, a new research paper introduces TOUCAN, the largest publicly available tool-agentic dataset to date. This groundbreaking dataset comprises an impressive 1.5 million trajectories, all synthesized from nearly 500 real-world Model Context Protocols (MCPs). Unlike previous efforts that relied on simulated or limited toolsets, TOUCAN leverages authentic MCP environments, which include over 2,000 tools, to generate tasks that are not only diverse and realistic but also challenging. These tasks involve actual tool execution, covering scenarios from parallel and multi-step tool calls to multi-turn conversations.

The creation of TOUCAN follows a meticulous five-stage pipeline. It begins with the onboarding of high-quality MCP servers, followed by the synthesis of diverse tool-use queries using five distinct LLMs. These tasks then undergo a rigorous model-based quality filtering process to ensure their relevance and difficulty. Subsequently, agentic trajectories are generated using three teacher models and two agentic frameworks. The final stage involves rule-based and LLM-based post-filtering to guarantee high-quality outputs, including verification of tool execution and response accuracy.

To further enhance data diversity and simulate real-world interactions, TOUCAN incorporates three extension mechanisms. These include generating queries that are unsolvable with the given toolset to train models to reject irrelevant requests, persona-based diversification to create varied task versions with new contexts and constraints, and a multi-turn self-simulation pipeline to generate realistic dialogues with follow-up questions.

The effectiveness of TOUCAN in boosting LLM agentic capabilities has been demonstrated through extensive experiments. Models fine-tuned on TOUCAN have shown superior performance compared to larger, closed-source counterparts on benchmarks like BFCL V3, excelling in function calling accuracy across both single-turn and multi-turn scenarios. Furthermore, these models achieved substantial improvements on τ-Bench and τ2-Bench, showing gains in tool selection, execution fidelity, and multi-turn reasoning. On the MCP-Universe benchmark, TOUCAN-tuned models achieved state-of-the-art performance within their parameter class, consistently outperforming leading models of comparable size.

Also Read:

In essence, TOUCAN provides a robust, open-source solution that significantly advances the training of more capable LLM agents. By offering a vast and diverse dataset derived from real-world tool interactions, it empowers the open-source community to develop more sophisticated and reliable AI systems. For more detailed information, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

TOUCAN: A New Frontier in Training Language Model Agents with Real-World Tool Data

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates