TLDR: IFEval-FC is a new benchmark that evaluates how well Large Language Models (LLMs) follow precise formatting instructions embedded in function calling parameters. Unlike existing benchmarks that focus on argument correctness, IFEval-FC uses verifiable instructions within JSON schemas. Results show that even state-of-the-art LLMs like GPT-5 and Claude Opus 4.1 frequently fail to adhere to these basic formatting rules, highlighting a critical limitation for real-world AI agent applications. The benchmark is algorithmic, reproducible, and publicly available.
Large Language Models (LLMs) are becoming the brains behind AI agents, enabling them to interact with various tools and systems through a process called function calling. This capability is crucial for AI agents to perform complex tasks in the real world, from scheduling appointments to managing data. However, a new research paper highlights a significant challenge: even the most advanced LLMs often struggle to follow precise formatting instructions embedded within function parameters.
The paper, titled “Instruction-Following Evaluation in Function Calling for Large Language Models,” introduces a new benchmark called IFEval-FC. Authored by Nikolai Skripko, this benchmark aims to rigorously test how well LLMs adhere to specific formatting rules, such as requiring a value to be enclosed in double quotes or a date to follow an ISO format. While existing benchmarks focus on whether an LLM correctly identifies which function to call or the general correctness of arguments, they often overlook these subtle yet critical formatting details.
The Problem: Overlooked Formatting Details
Imagine an AI agent needing to book a flight. The function for booking might require the passenger’s name to start with a capital letter or a date to be in a specific “YYYY-MM-DD” format. If the LLM generating the function call fails to follow these seemingly simple rules, the entire process can break down, leading to errors and inefficiencies in real-world applications. The researchers found that these “simple” format instructions are frequently misinterpreted or ignored by LLMs, causing downstream failures in agent workflows.
Introducing IFEval-FC: A New Standard for Evaluation
Inspired by the IFEval benchmark for general instruction following, IFEval-FC specifically targets function calling. Its core innovation lies in embedding “verifiable instructions” directly into the description fields of parameters within JSON schemas. For example, a parameter description might state, “a value must not contain punctuation” or “must be lowercase.” These instructions are designed to be objectively verifiable through automated checks, ensuring the evaluation is unbiased, repeatable, and scalable.
The benchmark includes 750 test cases. Each case involves a function with a specific format instruction for one of its input parameters and a corresponding user query. The evaluation process is entirely algorithmic, removing any reliance on human judgment or other LLMs for scoring.
How the Benchmark Was Built
The researchers developed IFEval-FC by identifying 19 distinct types of verifiable instructions, categorized into seven major groups. These instructions cover various constraints, including keyword presence, letter frequency, language restrictions (like Cyrillic Greek), length constraints (word or sentence count), detectable content (postscripts, placeholders), detectable formats (JSON, Python list, spacing, title format), case requirements (all uppercase/lowercase, number of capital words), and start/end phrases (quotations, specific endings, punctuation like comma frequency).
To create the dataset, some functions were adapted from existing benchmarks like BFCL, while others were synthetically generated using GPT-5. A key aspect was ensuring that each function included a “free-form parameter” where any format constraint could be applied. User queries were also generated using GPT-5, designed to be natural and conversational, and importantly, not to pre-satisfy the format. This ensures the LLM’s task is to transform the value into the required format, not just pass it through.
A notable observation during development was that some models, particularly Anthropic’s Claude Opus 4.1, showed a higher refusal rate when uncertain about calling a function. To ensure fair evaluation, a strict system message was implemented, explicitly instructing models to “ALWAYS CALL A FUNCTION” and “NEVER ASK A USER TO SPECIFY OR CLARIFY ANYTHING.”
Also Read:
- Guiding Language Models for Better Tool Use and Clearer Decisions
- Unpacking LLM Sequential Reasoning: A Look at seqBench
Key Findings and Future Directions
The results from IFEval-FC indicate that while newer models perform better than their predecessors, no evaluated model achieved above 80% accuracy. This suggests that precisely following instructions in function calling remains a significant challenge for LLMs, despite being a trivial task for humans. State-of-the-art proprietary models like GPT-5 and Claude Opus 4.1 frequently failed to adhere to basic formatting rules, underscoring a limitation for practical AI agent systems.
The researchers plan to expand the benchmark by increasing the number of available functions and introducing more challenging scenarios where models must first select the correct function from multiple options. Future work also includes incorporating multilingual support, drawing inspiration from M-IFEval, to assess cross-lingual instruction-following capabilities. The complete codebase and data for IFEval-FC are publicly available for further research and development, accessible at this link.


