spot_img
HomeResearch & DevelopmentEmpowering Open-Source LLMs for Complex Tool Use with GOAT

Empowering Open-Source LLMs for Complex Tool Use with GOAT

TLDR: GOAT is a novel training framework that automatically generates synthetic datasets for goal-oriented API execution tasks from API documentation, eliminating the need for human annotation. It enables fine-tuning of large language models (LLMs) to effectively decompose high-level objectives into interdependent API calls with correct planning and execution. Experiments show GOAT-trained open-source agents achieve state-of-the-art performance on goal-oriented benchmarks, often outperforming closed-source models, and introduces GOATBench as a new evaluation standard.

Large language models, or LLMs, have made incredible strides in understanding and generating human-like text. More recently, their capabilities have expanded to act as interactive agents that can use external tools, like APIs, to respond to user requests. However, a significant challenge remains: these LLM agents often struggle with “goal-oriented” queries. These are complex requests that require the agent to break down a high-level objective into many smaller, interconnected steps, plan the correct sequence of actions, and execute them using various API calls.

The main hurdle for improving LLM agents in this area is the scarcity of training data. Creating datasets that accurately capture the intricate dependencies between API calls usually demands extensive human annotation, which is both time-consuming and expensive. While powerful proprietary models like GPT-4 show strong reasoning abilities, smaller, open-source models often fall short when faced with complex tool-use scenarios.

To address this critical gap, researchers have introduced a new training framework called GOAT (Goal-Oriented Agent with Tools). This innovative framework allows for the fine-tuning of LLM agents without the need for human annotation. GOAT automatically builds synthetic datasets for goal-oriented API execution tasks directly from existing API documentation. This process equips models with the ability to reason about interdependent API calls and generate coherent responses.

How GOAT Works

GOAT’s approach is quite clever. It starts by taking API documentation – which is usually readily available for any set of target APIs – and uses it to construct a detailed API dependency graph. This graph maps out all the possible ways the output of one API can be used as an input for another. To ensure accuracy, this initial graph undergoes a rigorous three-step filtering process:

  • Embedding Similarity: Unlikely connections are quickly pruned by comparing the semantic descriptions of API outputs and inputs.
  • LLM Filtering: A large language model then semantically evaluates the remaining connections, determining if an output can meaningfully populate an input.
  • API Call Execution: Finally, the most crucial step involves actually executing API calls with plausible arguments and verifying if the output of one call can successfully be used as input for the next. This grounds the dependencies in real-world execution.

Once a reliable dependency graph is established, GOAT generates goal-oriented task samples. It extracts connected sequences of API calls from the graph, instantiates and executes them, and then creates natural language “sub-queries” for each step. Crucially, it then generates a high-level user query that encapsulates the overall task and a final natural language response that interprets the API outputs in context. This “call-first” strategy is a key innovation, as it’s much easier for an LLM to summarize executed API calls into a user query than to infer complex API calls from a high-level query, providing more reliable training signals.

Introducing GOATBench

Beyond the training framework, the GOAT team also introduced GOATBench, a new evaluation benchmark specifically designed for goal-oriented API execution tasks. This benchmark, built using GOAT’s data generation pipeline and human verification, helps assess how well agents can handle tasks requiring planning and invoking sequences of interconnected APIs. It includes tasks categorized as “Single Tool” (multiple APIs from the same tool) and “Inter Tool” (APIs across different tools), testing various aspects of an agent’s reasoning capabilities.

Also Read:

Impressive Results

Extensive experiments on GOATBench and other existing goal-oriented benchmarks like RestBench and API-Bank have shown remarkable results. Agents trained with GOAT achieve state-of-the-art performance among open-source models. In some instances, they even surpass certain closed-source models known for their strong reasoning abilities. The framework consistently boosts performance across different LLM backbones, including Qwen2-7B, Llama3-8B, and Llama3-70B, demonstrating its robustness and general applicability. Even when the same LLM is used for both data generation and fine-tuning, significant gains are observed, validating the effectiveness of GOAT’s call-first generation strategy.

These findings highlight GOAT as a practical and scalable solution for developing robust open-source LLM agents capable of complex reasoning and effective tool use. For more in-depth information, you can read the full research paper here.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -