Assessing Large Language Models' Proficiency in Sequential API Calls

TLDR: StateGen is an automated framework that generates diverse, executable coding tasks involving sequential API interactions. It uses state machines, energy-based sampling, and control-flow injection to create programs, which are then translated into natural language tasks by LLM agents. This framework is used to build StateEval, a new benchmark of 120 verified test cases across three scenarios (Session Service, Tensor Operation, ElevenLabs MCP). Evaluations show that StateGen effectively creates challenging tasks, and current LLMs, especially open-source ones, have significant room for improvement in handling complex, interdependent API calls.

Large Language Models (LLMs) have significantly expanded their capabilities by integrating external tools through APIs, enabling them to tackle complex real-world tasks. However, a major challenge lies in effectively testing and evaluating how well these LLMs use tools, especially when multiple API calls need to happen in a specific sequence. Existing evaluation methods often rely on manually created test cases, which are difficult to scale and frequently miss the intricate interactions that occur in real-world sequential API usage.

To address this critical gap, researchers have introduced StateGen, an innovative automated framework designed to generate diverse coding tasks that involve sequential API interactions. StateGen is a sophisticated system that combines several key techniques: state-machine-based API constraint solving and validation, energy-based sampling to ensure a wide variety of generated tasks, and control-flow injection to create more realistic and complex executable programs. Once these programs are generated, they are then translated into human-like natural language task descriptions through a collaborative process involving two LLM agents.

Utilizing the power of StateGen, the researchers constructed StateEval, a new benchmark comprising 120 carefully verified test cases. These cases span three distinct and representative scenarios: a Session Service (mimicking RESTful API calls), Tensor Operations (involving complex data manipulation in deep learning frameworks like PyTorch), and ElevenLabs MCP (demonstrating LLM tool calling for speech processing). Experimental results confirm that StateGen is highly effective at generating challenging and realistic API-oriented tasks, thereby highlighting areas where current LLMs incorporating APIs can be significantly improved.

Why StateGen and StateEval are Important

The paper emphasizes that in real-world software development, tasks often require LLMs to analyze requirements, understand API functionalities, and then orchestrate multiple APIs in the correct sequence with appropriate inputs. This multi-step process demands advanced reasoning, management, planning, and tool-calling abilities from LLMs. Traditional benchmarks often fall short by focusing on general, small-scale coding tasks or simple API calls without interdependencies. StateEval, built with StateGen, aims to fill this void by providing a systematic and scalable way to assess LLMs’ ability to handle complex instructions and generate stateful programs.

How StateGen Works Under the Hood

StateGen employs a “reverse-generation” strategy. It starts by creating valid, executable sequences of API calls (called ‘traces’). These traces form the backbone for constructing more intricate executable programs by adding control flow structures like ‘if-else’ branches, which are common in real-world code. A crucial part of StateGen is its TraceGenerator, which maintains a state schema to track all relevant program states and ensures that each generated API sequence is valid and executable. To maximize diversity, it uses an energy-based sampling strategy, prioritizing the exploration of less frequent API transition pairs.

The generated programs are then transformed into natural language instructions using a multi-agent system. A ‘generator agent’ creates initial descriptions, and an ‘evaluator agent’ provides feedback to refine these descriptions, ensuring they are unambiguous, natural, and non-redundant. This iterative negotiation process helps produce high-quality test inputs for the LLMs under evaluation. Finally, to obtain accurate test oracles, StateGen executes the generated programs in a local environment, recording state transitions and variable values as ground truth for evaluation.

Key Findings from the Evaluation

The study evaluated StateGen’s effectiveness and compared StateEval with existing benchmarks, as well as assessing the performance of various LLMs. StateGen demonstrated superior effectiveness in generating diverse sequential API calls, achieving higher coverage and faster convergence compared to random baselines or LLM-only generation approaches. This indicates that StateGen can produce a broader range of local sequential structures, leading to more diverse test cases.

When comparing StateEval to other popular benchmarks like HumanEval, DS-1000, and BFCL, StateEval stood out with significantly longer instructions and reference code, a higher average number of function calls, and substantially greater Path Depth and Binding Count. These metrics indicate that StateEval constructs meaningful dependencies across API calls, resulting in more interdependent program structures, making it a more challenging and realistic benchmark for evaluating LLMs’ capacity to understand complex instructions and produce multi-API calls with rich interdependencies.

In terms of LLM performance, closed-source models like GPT-4.1 and Gemini-2.5-Flash generally outperformed open-source models such as Qwen2.5-Coder and Llama-4-Scout. GPT-4.1 achieved the highest pass@1 rate at 56%. Interestingly, LLM performance varied significantly across tasks; for instance, GPT-4.1 achieved a 78% pass rate on Tensor Operation but only 22% on Session Service. This disparity is hypothesized to be due to the greater availability of training data and examples for tensor-related operations. All models performed poorly on Session Service, likely due to the scarcity of online resources and the complex data manipulations required.

An analysis of errors revealed that execution errors (programs crashing during runtime) and result errors (programs running but producing incorrect outputs) were the most prevalent. Syntax errors were less common. Execution errors were frequent in Session Service (e.g., accessing non-existent data) and Tensor Operation (e.g., incompatible tensor shapes). Result errors were common in ElevenLabs MCP, where incorrect API usage could lead to subtle state transition errors detected only at the final result check.

Also Read:

Looking Ahead

While StateGen and StateEval represent a significant step forward, the researchers acknowledge limitations. Currently, incorporating new API scenarios requires manual modeling of API documentation, which can be labor-intensive. Future work could explore using LLM-enabled middleware to automate this process. Additionally, more in-depth analysis is needed, including evaluating a broader range of LLMs and prompting techniques, and investigating methods to enhance the correctness and robustness of LLM-generated multi-function calls.

This research, detailed in the paper “Evaluating LLMs on Sequential API Call Through Automated Test Generation”, provides a robust foundation for future research and development, aiming to lead to LLM systems with more reliable and effective tool integration.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing Large Language Models’ Proficiency in Sequential API Calls

Why StateGen and StateEval are Important

How StateGen Works Under the Hood

Key Findings from the Evaluation

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates