Evaluating LLM Precision: A New Benchmark for Function Calling Instruction Adherence

TLDR: IFEval-FC is a new benchmark that evaluates how well Large Language Models (LLMs) follow precise formatting instructions embedded in function calling parameters. Unlike existing benchmarks that focus on argument correctness, IFEval-FC uses verifiable instructions within JSON schemas. Results show that even state-of-the-art LLMs like GPT-5 and Claude Opus 4.1 frequently fail to adhere to these basic formatting rules, highlighting a critical limitation for real-world AI agent applications. The benchmark is algorithmic, reproducible, and publicly available.

Large Language Models (LLMs) are becoming the brains behind AI agents, enabling them to interact with various tools and systems through a process called function calling. This capability is crucial for AI agents to perform complex tasks in the real world, from scheduling appointments to managing data. However, a new research paper highlights a significant challenge: even the most advanced LLMs often struggle to follow precise formatting instructions embedded within function parameters.

The paper, titled “Instruction-Following Evaluation in Function Calling for Large Language Models,” introduces a new benchmark called IFEval-FC. Authored by Nikolai Skripko, this benchmark aims to rigorously test how well LLMs adhere to specific formatting rules, such as requiring a value to be enclosed in double quotes or a date to follow an ISO format. While existing benchmarks focus on whether an LLM correctly identifies which function to call or the general correctness of arguments, they often overlook these subtle yet critical formatting details.

The Problem: Overlooked Formatting Details

Imagine an AI agent needing to book a flight. The function for booking might require the passenger’s name to start with a capital letter or a date to be in a specific “YYYY-MM-DD” format. If the LLM generating the function call fails to follow these seemingly simple rules, the entire process can break down, leading to errors and inefficiencies in real-world applications. The researchers found that these “simple” format instructions are frequently misinterpreted or ignored by LLMs, causing downstream failures in agent workflows.

Introducing IFEval-FC: A New Standard for Evaluation

Inspired by the IFEval benchmark for general instruction following, IFEval-FC specifically targets function calling. Its core innovation lies in embedding “verifiable instructions” directly into the description fields of parameters within JSON schemas. For example, a parameter description might state, “a value must not contain punctuation” or “must be lowercase.” These instructions are designed to be objectively verifiable through automated checks, ensuring the evaluation is unbiased, repeatable, and scalable.

The benchmark includes 750 test cases. Each case involves a function with a specific format instruction for one of its input parameters and a corresponding user query. The evaluation process is entirely algorithmic, removing any reliance on human judgment or other LLMs for scoring.

How the Benchmark Was Built

The researchers developed IFEval-FC by identifying 19 distinct types of verifiable instructions, categorized into seven major groups. These instructions cover various constraints, including keyword presence, letter frequency, language restrictions (like Cyrillic Greek), length constraints (word or sentence count), detectable content (postscripts, placeholders), detectable formats (JSON, Python list, spacing, title format), case requirements (all uppercase/lowercase, number of capital words), and start/end phrases (quotations, specific endings, punctuation like comma frequency).

To create the dataset, some functions were adapted from existing benchmarks like BFCL, while others were synthetically generated using GPT-5. A key aspect was ensuring that each function included a “free-form parameter” where any format constraint could be applied. User queries were also generated using GPT-5, designed to be natural and conversational, and importantly, not to pre-satisfy the format. This ensures the LLM’s task is to transform the value into the required format, not just pass it through.

A notable observation during development was that some models, particularly Anthropic’s Claude Opus 4.1, showed a higher refusal rate when uncertain about calling a function. To ensure fair evaluation, a strict system message was implemented, explicitly instructing models to “ALWAYS CALL A FUNCTION” and “NEVER ASK A USER TO SPECIFY OR CLARIFY ANYTHING.”

Also Read:

Key Findings and Future Directions

The results from IFEval-FC indicate that while newer models perform better than their predecessors, no evaluated model achieved above 80% accuracy. This suggests that precisely following instructions in function calling remains a significant challenge for LLMs, despite being a trivial task for humans. State-of-the-art proprietary models like GPT-5 and Claude Opus 4.1 frequently failed to adhere to basic formatting rules, underscoring a limitation for practical AI agent systems.

The researchers plan to expand the benchmark by increasing the number of available functions and introducing more challenging scenarios where models must first select the correct function from multiple options. Future work also includes incorporating multilingual support, drawing inspiration from M-IFEval, to assess cross-lingual instruction-following capabilities. The complete codebase and data for IFEval-FC are publicly available for further research and development, accessible at this link.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Evaluating LLM Precision: A New Benchmark for Function Calling Instruction Adherence

The Problem: Overlooked Formatting Details

Introducing IFEval-FC: A New Standard for Evaluation

How the Benchmark Was Built

Key Findings and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates