TLDR: This research introduces a framework using structured reasoning templates and a specialized dataset (ToolGT) to improve how large language models (LLMs) make function calls and explain their decisions. By guiding LLMs through deliberate, step-by-step instructions, the method reduces errors, enhances robustness, and increases the interpretability of tool-using AI agents, outperforming traditional Chain-of-Thought approaches in various benchmarks. The study demonstrates that both structured prompting and fine-tuning with these templates significantly boost accuracy and transparency across different models.
Large Language Models (LLMs) have shown impressive abilities in reasoning and using tools, but they often struggle when interacting with real-world tools. Common issues include selecting the wrong tool, using incorrect parameters, or misunderstanding what a user wants. These problems often arise because LLMs don’t fully grasp user goals or tool documentation.
While a technique called Chain-of-Thought (CoT) prompting has been effective for general reasoning, this research found that free-form CoT isn’t always enough, and can even be counterproductive, for tasks that involve structured function calling.
To tackle these challenges, a new framework has been introduced that uses structured reasoning templates. These templates guide LLMs through a more deliberate, step-by-step process for generating function calls. The experimental results indicate that this method significantly reduces errors in tool use, showing a 3–12% relative improvement over existing strong methods across different model types and approaches. Furthermore, this framework makes tool-using AI agents more robust, easier to understand, and more transparent, paving the way for more reliable AI assistants in practical applications.
The Core Problem: LLMs as ‘Black Boxes’
Despite their advanced capabilities, current LLMs frequently make mistakes in function calls. These errors can include incorrect parameterization, poor tool selection, or misinterpreting user intent, sometimes even leading to ‘hallucinations’ where the model invents information. Such failures are critical in real-world applications where accuracy is paramount for safety and trust.
Another significant issue is the ‘black-box’ nature of many LLMs when generating function calls. They often provide no explanation for why they chose a particular function, what parameter values they selected, or what the expected outcome is. This lack of explainability makes debugging difficult and hinders human oversight, which is crucial in sensitive areas like healthcare and finance. Without clear reasoning, it’s hard for people to verify if the tool usage is appropriate, increasing the risk of serious errors.
A New Approach: Structured Guidance
The researchers propose a template-based reasoning framework for function calling that structures the LLM’s thought process according to task demands and tool specifications. This template systematically guides models through critical sub-tasks, mimicking how humans solve problems. Initial experiments showed that fixed structured templates improved accuracy, but still had limitations in formatting, logical consistency, and functional correctness.
To overcome these, a pipeline called ToolGT was developed. This pipeline constructs a synthetic fine-tuning dataset that systematically encodes reasoning patterns using structured templates. This dataset is specifically designed to teach models to maintain correct formatting, execute step-by-step analytical reasoning, and produce outputs that precisely align with API specifications, directly addressing the weaknesses observed in earlier template-prompting methods.
Key Contributions
The work makes two main contributions:
- Template-Based Reasoning: An explicit prompting template guides LLMs through essential stages of function calling, including understanding the tool, extracting parameters, converting implicit values, and meeting other task-specific requirements.
- Structured Reasoning Dataset: An approach for building the Guided-Template structured reasoning dataset (ToolGT) that effectively trains models to improve accuracy and transparency across various tasks and model architectures.
The researchers argue that providing LLMs with curriculum-style reasoning templates leads to more reliable and generalizable tool use. Instead of relying solely on unconstrained Chain-of-Thought reasoning, adaptive and context-specific structures help models better align with user intent, execute accurate function calls, and provide interpretable justifications.
How It Works: Methodology
The framework has two main parts: prompting strategies and fine-tuning strategies based on the new data construction method.
The function calling task is extended to include a structured reasoning chain, which provides an interpretable, step-by-step justification for identifying, selecting, examining, and parameterizing functions. This enhances transparency and reliability.
For prompting, a structured methodology guides LLMs through clearly defined reasoning steps. Unlike simple CoT, this method uses a structured template to enforce discrete reasoning stages. The template includes steps like identifying functions, deciding on relevancy, examining documentation, extracting and validating parameters, converting parameter types, drafting the function, and revalidating the call.
For fine-tuning, a high-quality Guided-Template dataset (ToolGT) is constructed. This process involves using an existing tool-oriented dataset (ToolACE), converting multi-turn dialogues into single-turn samples, and then using advanced LLMs (like GPT-4o-mini) to generate step-by-step reasoning chains guided by the template. These reasoning chains are then validated through a two-stage verification process: manual checks (Exact Match and Abstract Syntax Tree) and LLM-based verification to ensure high data quality.
Also Read:
- Reasoning Core: A Scalable Platform for Training LLMs in Foundational Logic
- Foundation Models Navigate Virtual Worlds: New Strategies for Reinforcement Learning
Performance and Insights
Experiments were conducted on standard benchmarks like BFCLv2 and Nexus, using a variety of closed- and open-source models. The results consistently showed that template-based prompting often led to better performance than both ‘No Thought’ (direct function calls) and traditional CoT approaches.
For instance, models like GPT-4o-FC and LLaMA-3-70B-Instruct achieved their best performance with template prompting. Even for models like Qwen-2.5-14B-Instruct, where adding reasoning steps could sometimes lower performance compared to direct function calling, template prompting still performed significantly better than CoT, demonstrating its robustness and ability to maintain interpretability without sacrificing performance.
Interestingly, smaller models like Mistral-7B-Instruct-v0.3 initially struggled with template prompting due to difficulties in following structured formats. However, after template-based fine-tuning, these models showed significant improvements, highlighting that dedicated training is crucial for models to effectively utilize structured reasoning.
The research also explored the impact of template complexity, finding that a detailed template generally achieved the highest accuracy, though simpler templates could sometimes be better for specific subtasks. A limitation identified was that the current training datasets might not adequately cover complex, nested function-call scenarios, which can lead to performance degradation in such cases after fine-tuning.
This work lays a strong foundation for future research in structured reasoning and advanced tool integration for the next generation of LLM agents. For more technical details, you can refer to the full research paper here.


