Empowering Open-Source LLMs for Complex Tool Use with GOAT

TLDR: GOAT is a novel training framework that automatically generates synthetic datasets for goal-oriented API execution tasks from API documentation, eliminating the need for human annotation. It enables fine-tuning of large language models (LLMs) to effectively decompose high-level objectives into interdependent API calls with correct planning and execution. Experiments show GOAT-trained open-source agents achieve state-of-the-art performance on goal-oriented benchmarks, often outperforming closed-source models, and introduces GOATBench as a new evaluation standard.

Large language models, or LLMs, have made incredible strides in understanding and generating human-like text. More recently, their capabilities have expanded to act as interactive agents that can use external tools, like APIs, to respond to user requests. However, a significant challenge remains: these LLM agents often struggle with “goal-oriented” queries. These are complex requests that require the agent to break down a high-level objective into many smaller, interconnected steps, plan the correct sequence of actions, and execute them using various API calls.

The main hurdle for improving LLM agents in this area is the scarcity of training data. Creating datasets that accurately capture the intricate dependencies between API calls usually demands extensive human annotation, which is both time-consuming and expensive. While powerful proprietary models like GPT-4 show strong reasoning abilities, smaller, open-source models often fall short when faced with complex tool-use scenarios.

To address this critical gap, researchers have introduced a new training framework called GOAT (Goal-Oriented Agent with Tools). This innovative framework allows for the fine-tuning of LLM agents without the need for human annotation. GOAT automatically builds synthetic datasets for goal-oriented API execution tasks directly from existing API documentation. This process equips models with the ability to reason about interdependent API calls and generate coherent responses.

How GOAT Works

GOAT’s approach is quite clever. It starts by taking API documentation – which is usually readily available for any set of target APIs – and uses it to construct a detailed API dependency graph. This graph maps out all the possible ways the output of one API can be used as an input for another. To ensure accuracy, this initial graph undergoes a rigorous three-step filtering process:

Embedding Similarity: Unlikely connections are quickly pruned by comparing the semantic descriptions of API outputs and inputs.
LLM Filtering: A large language model then semantically evaluates the remaining connections, determining if an output can meaningfully populate an input.
API Call Execution: Finally, the most crucial step involves actually executing API calls with plausible arguments and verifying if the output of one call can successfully be used as input for the next. This grounds the dependencies in real-world execution.

Once a reliable dependency graph is established, GOAT generates goal-oriented task samples. It extracts connected sequences of API calls from the graph, instantiates and executes them, and then creates natural language “sub-queries” for each step. Crucially, it then generates a high-level user query that encapsulates the overall task and a final natural language response that interprets the API outputs in context. This “call-first” strategy is a key innovation, as it’s much easier for an LLM to summarize executed API calls into a user query than to infer complex API calls from a high-level query, providing more reliable training signals.

Introducing GOATBench

Beyond the training framework, the GOAT team also introduced GOATBench, a new evaluation benchmark specifically designed for goal-oriented API execution tasks. This benchmark, built using GOAT’s data generation pipeline and human verification, helps assess how well agents can handle tasks requiring planning and invoking sequences of interconnected APIs. It includes tasks categorized as “Single Tool” (multiple APIs from the same tool) and “Inter Tool” (APIs across different tools), testing various aspects of an agent’s reasoning capabilities.

Also Read:

Impressive Results

Extensive experiments on GOATBench and other existing goal-oriented benchmarks like RestBench and API-Bank have shown remarkable results. Agents trained with GOAT achieve state-of-the-art performance among open-source models. In some instances, they even surpass certain closed-source models known for their strong reasoning abilities. The framework consistently boosts performance across different LLM backbones, including Qwen2-7B, Llama3-8B, and Llama3-70B, demonstrating its robustness and general applicability. Even when the same LLM is used for both data generation and fine-tuning, significant gains are observed, validating the effectiveness of GOAT’s call-first generation strategy.

These findings highlight GOAT as a practical and scalable solution for developing robust open-source LLM agents capable of complex reasoning and effective tool use. For more in-depth information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Empowering Open-Source LLMs for Complex Tool Use with GOAT

How GOAT Works

Introducing GOATBench

Impressive Results

Gen AI News and Updates

Bridging Natural Language and Graph Databases: A Multi-Agent Approach to Cypher Query Generation

AI Models Learn to Predict Polymer Properties from Images and Text

The Fading Footprints: How Fine-Tuning Impacts Knowledge Edits in Language Models

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates