TLDR: StateGen is an automated framework that generates diverse, executable coding tasks involving sequential API interactions. It uses state machines, energy-based sampling, and control-flow injection to create programs, which are then translated into natural language tasks by LLM agents. This framework is used to build StateEval, a new benchmark of 120 verified test cases across three scenarios (Session Service, Tensor Operation, ElevenLabs MCP). Evaluations show that StateGen effectively creates challenging tasks, and current LLMs, especially open-source ones, have significant room for improvement in handling complex, interdependent API calls.
Large Language Models (LLMs) have significantly expanded their capabilities by integrating external tools through APIs, enabling them to tackle complex real-world tasks. However, a major challenge lies in effectively testing and evaluating how well these LLMs use tools, especially when multiple API calls need to happen in a specific sequence. Existing evaluation methods often rely on manually created test cases, which are difficult to scale and frequently miss the intricate interactions that occur in real-world sequential API usage.
To address this critical gap, researchers have introduced StateGen, an innovative automated framework designed to generate diverse coding tasks that involve sequential API interactions. StateGen is a sophisticated system that combines several key techniques: state-machine-based API constraint solving and validation, energy-based sampling to ensure a wide variety of generated tasks, and control-flow injection to create more realistic and complex executable programs. Once these programs are generated, they are then translated into human-like natural language task descriptions through a collaborative process involving two LLM agents.
Utilizing the power of StateGen, the researchers constructed StateEval, a new benchmark comprising 120 carefully verified test cases. These cases span three distinct and representative scenarios: a Session Service (mimicking RESTful API calls), Tensor Operations (involving complex data manipulation in deep learning frameworks like PyTorch), and ElevenLabs MCP (demonstrating LLM tool calling for speech processing). Experimental results confirm that StateGen is highly effective at generating challenging and realistic API-oriented tasks, thereby highlighting areas where current LLMs incorporating APIs can be significantly improved.
Why StateGen and StateEval are Important
The paper emphasizes that in real-world software development, tasks often require LLMs to analyze requirements, understand API functionalities, and then orchestrate multiple APIs in the correct sequence with appropriate inputs. This multi-step process demands advanced reasoning, management, planning, and tool-calling abilities from LLMs. Traditional benchmarks often fall short by focusing on general, small-scale coding tasks or simple API calls without interdependencies. StateEval, built with StateGen, aims to fill this void by providing a systematic and scalable way to assess LLMs’ ability to handle complex instructions and generate stateful programs.
How StateGen Works Under the Hood
StateGen employs a “reverse-generation” strategy. It starts by creating valid, executable sequences of API calls (called ‘traces’). These traces form the backbone for constructing more intricate executable programs by adding control flow structures like ‘if-else’ branches, which are common in real-world code. A crucial part of StateGen is its TraceGenerator, which maintains a state schema to track all relevant program states and ensures that each generated API sequence is valid and executable. To maximize diversity, it uses an energy-based sampling strategy, prioritizing the exploration of less frequent API transition pairs.
The generated programs are then transformed into natural language instructions using a multi-agent system. A ‘generator agent’ creates initial descriptions, and an ‘evaluator agent’ provides feedback to refine these descriptions, ensuring they are unambiguous, natural, and non-redundant. This iterative negotiation process helps produce high-quality test inputs for the LLMs under evaluation. Finally, to obtain accurate test oracles, StateGen executes the generated programs in a local environment, recording state transitions and variable values as ground truth for evaluation.
Key Findings from the Evaluation
The study evaluated StateGen’s effectiveness and compared StateEval with existing benchmarks, as well as assessing the performance of various LLMs. StateGen demonstrated superior effectiveness in generating diverse sequential API calls, achieving higher coverage and faster convergence compared to random baselines or LLM-only generation approaches. This indicates that StateGen can produce a broader range of local sequential structures, leading to more diverse test cases.
When comparing StateEval to other popular benchmarks like HumanEval, DS-1000, and BFCL, StateEval stood out with significantly longer instructions and reference code, a higher average number of function calls, and substantially greater Path Depth and Binding Count. These metrics indicate that StateEval constructs meaningful dependencies across API calls, resulting in more interdependent program structures, making it a more challenging and realistic benchmark for evaluating LLMs’ capacity to understand complex instructions and produce multi-API calls with rich interdependencies.
In terms of LLM performance, closed-source models like GPT-4.1 and Gemini-2.5-Flash generally outperformed open-source models such as Qwen2.5-Coder and Llama-4-Scout. GPT-4.1 achieved the highest pass@1 rate at 56%. Interestingly, LLM performance varied significantly across tasks; for instance, GPT-4.1 achieved a 78% pass rate on Tensor Operation but only 22% on Session Service. This disparity is hypothesized to be due to the greater availability of training data and examples for tensor-related operations. All models performed poorly on Session Service, likely due to the scarcity of online resources and the complex data manipulations required.
An analysis of errors revealed that execution errors (programs crashing during runtime) and result errors (programs running but producing incorrect outputs) were the most prevalent. Syntax errors were less common. Execution errors were frequent in Session Service (e.g., accessing non-existent data) and Tensor Operation (e.g., incompatible tensor shapes). Result errors were common in ElevenLabs MCP, where incorrect API usage could lead to subtle state transition errors detected only at the final result check.
Also Read:
- CodeJudgeBench: A New Benchmark for Evaluating AI Code Judges
- A New Benchmark for AI’s Role in Research Experiment Design
Looking Ahead
While StateGen and StateEval represent a significant step forward, the researchers acknowledge limitations. Currently, incorporating new API scenarios requires manual modeling of API documentation, which can be labor-intensive. Future work could explore using LLM-enabled middleware to automate this process. Additionally, more in-depth analysis is needed, including evaluating a broader range of LLMs and prompting techniques, and investigating methods to enhance the correctness and robustness of LLM-generated multi-function calls.
This research, detailed in the paper “Evaluating LLMs on Sequential API Call Through Automated Test Generation”, provides a robust foundation for future research and development, aiming to lead to LLM systems with more reliable and effective tool integration.


