Enhancing LLM Agent Reliability: A Fine-Grained Approach to Function Calling

TLDR: ToolPRM is a new framework that improves how large language models (LLMs) perform structured function calls. It uses a “process reward model” to evaluate each small step within a function call, rather than just the final outcome. This fine-grained supervision, combined with a specialized beam search strategy (“explore more but retain less”), leads to more accurate and reliable function calls, especially benefiting smaller LLMs and making them more suitable for on-device applications.

Large language models (LLMs) are becoming increasingly powerful as autonomous agents, capable of interacting with their environment through a mechanism called function calling. This allows them to use external tools, retrieve information, and perform actions, bridging their linguistic reasoning with real-world operations.

A key technique to boost LLM performance is “inference scaling,” which involves allocating more computational resources during the inference process. While this has been widely explored for generating unstructured text, its application to structured outputs, like the precise format required for function calls, has been largely overlooked. This gap is significant because the correctness and structure of function call outputs directly impact the reliability of the entire system.

Introducing ToolPRM: A New Approach to Function Calling

To address this, researchers have introduced a novel inference scaling framework called ToolPRM. This framework combines a fine-grained beam search with a process reward model (ToolPRM) that evaluates the internal steps of each individual function call. Unlike previous methods that treat an entire function call as a single, indivisible unit, ToolPRM breaks down the process into smaller, semantically meaningful steps.

These fine-grained steps include selecting the correct function name, identifying relevant parameters, and assigning the appropriate values to those parameters. By evaluating each of these sub-tasks, ToolPRM provides much more detailed feedback, allowing for better supervision of the function calling inference process.

Training ToolPRM with Fine-Grained Supervision

To train ToolPRM, the team created the first fine-grained intra-call process supervision dataset. This dataset was automatically annotated using “function-masking techniques,” where function names and parameter identifiers are replaced with random strings. This encourages the model to understand the context and descriptions rather than just memorizing tool names, making it more robust and generalizable.

The annotation process assigns binary labels (correct or incorrect) to each step, such as whether the function name was chosen correctly, if a parameter-value pair was filled in accurately, or if the entire function call sequence was correct. This hierarchical step-level feedback helps the reward model learn and perform better.

The “Explore More But Retain Less” Principle

A crucial insight from this research is a new principle for applying inference scaling to structured outputs: “explore more but retain less.” In unstructured tasks like mathematical reasoning, errors can often be corrected later. Therefore, maintaining a diverse set of candidate reasoning paths (a larger number of active “beams” in a beam search) can be beneficial.

However, for structured outputs like function calls, an early mistake (e.g., a wrong function name) can invalidate the entire subsequent trajectory, making it unrecoverable. In such cases, retaining many incorrect candidates wastes computational resources. ToolPRM’s approach is to expand the search space more broadly at each decision point (increase “beam width”) but aggressively prune unpromising or incorrect candidates (reduce the number of active “beams”) based on its precise step-wise supervision. This ensures that computational resources are focused on generating valid and high-quality structured outputs.

Also Read:

Experimental Validation and Impact

Extensive experiments demonstrated that ToolPRM significantly outperforms coarse-grained and outcome-based reward models in terms of predictive accuracy. When integrated with inference scaling techniques, ToolPRM substantially improves the performance of backbone models across various function calling tasks and benchmarks.

Notably, ToolPRM provides a more significant performance boost for smaller policy models. For example, a 1.5B model augmented with ToolPRM can achieve performance comparable to a baseline 3B model, and a 3B model with ToolPRM can match a baseline 7B model. This makes ToolPRM particularly valuable for on-device inference scenarios, where computational resources are limited, and function calling is frequently deployed.

This research advances our understanding and practical application of scalable, trustworthy, and fine-grained reasoning mechanisms for LLM agents performing structured function calls. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing LLM Agent Reliability: A Fine-Grained Approach to Function Calling

Introducing ToolPRM: A New Approach to Function Calling

Training ToolPRM with Fine-Grained Supervision

The “Explore More But Retain Less” Principle

Experimental Validation and Impact

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates