Automating REST API Tests with Language Models and Test Specifications

TLDR: A new research paper introduces RestTSLLM, an approach that combines Test Specification Language (TSL) with Large Language Models (LLMs) to automate the generation of integration tests for REST APIs. By using prompt engineering and an intermediate TSL step, LLMs are guided to create test scenarios from OpenAPI specifications and convert them into executable tests. An evaluation of various LLMs, including Claude 3.5 Sonnet, Deepseek R1, and Qwen 2.5 32b, demonstrated their effectiveness in generating high-quality tests with strong success rates, coverage, and mutation scores. Claude 3.5 Sonnet emerged as the top performer, highlighting the significant potential of LLMs in streamlining and enhancing REST API testing processes.

Testing plays a critical role in ensuring the quality and reliability of software systems. However, effectively testing REST APIs, which are widely used for communication between different services, presents significant challenges. The complexity of distributed systems, the vast number of possible scenarios, and limited time for test design often lead to incomplete testing, undetected failures, and high manual effort.

To address these persistent issues, researchers have introduced RestTSLLM, an innovative approach that combines Test Specification Language (TSL) with Large Language Models (LLMs) to automate the generation of test cases for REST APIs. This method specifically targets two core challenges: creating comprehensive test scenarios and defining appropriate input data.

The RestTSLLM approach integrates prompt engineering techniques with an automated pipeline to evaluate various LLMs. It works by first instructing the LLM to act as an experienced developer and tester, capable of understanding REST API specifications. Then, through a ‘few-shot’ and ‘decomposed prompting’ technique, the LLM is shown examples of how to convert an OpenAPI specification into structured test cases using TSL, and subsequently how to translate those TSL cases into executable integration tests, for instance, using .NET with xUnit.

The use of TSL as an intermediate step is crucial. It simplifies the problem for the LLM by allowing it to focus solely on understanding business rules and defining test scenarios in a human-readable, declarative format, without being burdened by code structure or syntax. Once these scenarios are clear in TSL, a second prompt guides the LLM to convert them into functional test code.

An extensive evaluation was conducted on eight prominent LLMs: Claude 3.5 Sonnet (Anthropic), Deepseek R1 (Deepseek), Qwen 2.5 32b (Alibaba), Sabiá 3 (Maritaca), LLaMA 3.2 90b (Meta), GPT 4o (OpenAI), Gemini 1.5 Pro (Google), and Mistral Large (Mistral). These models were tested against six open-source REST API projects. The evaluation focused on key metrics such as success rate, test coverage (specifically branch coverage), and mutation score, which assesses how well tests detect small changes in the system’s logic. A calculated score, using the TOPSIS technique, combined these metrics to determine overall performance.

The results were highly promising. All evaluated LLMs demonstrated effectiveness in generating integration tests that reflected the intended business logic and context, producing compilable code with high readability and adherence to test patterns. The average success rates across all models were above 95.5%, indicating that the generated tests were largely functional and stable.

Also Read:

Top Performing Models

Among the models, Claude 3.5 Sonnet emerged as the top performer, achieving the highest average calculated score and ranking first in all individual metrics. It was notably the only model that produced no failed tests during the evaluation. Deepseek R1, Qwen 2.5 32b, and Sabiá 3 also delivered strong results, closely following Claude 3.5 Sonnet in performance. Even models with lower average scores, such as Mistral Large, Gemini 1.5 Pro, GPT 4o, and LLaMA 3.2 90b, still showed solid performance, particularly in success rate and often in coverage or mutation score.

The study also highlighted the cost-effectiveness of using LLMs for test generation. The total cost of processing each project with any LLM remained very low, with several models delivering competitive results for less than $0.09 per execution, making this approach feasible even for budget-constrained environments.

While the approach showed significant potential, the researchers also identified areas for future improvement. These include addressing limitations related to LLM selection, the complexity of target projects, dependency on OpenAPI specifications, and the inherent subjectivity in qualitative analysis. Future work aims to expand the generalizability of the method to more complex architectures and different technologies, enhance automation with error correction techniques, and explore multilingual performance.

In conclusion, the RestTSLLM approach demonstrates that combining Test Specification Language with Large Language Models offers a viable and effective strategy for automating the generation of integration tests for REST APIs. This method not only streamlines the testing process but also enhances the quality and coverage of tests, marking a significant step forward in software testing automation. For more details, you can refer to the full research paper: Combining TSL and LLM to Automate REST API Testing: A Comparative Study.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Automating REST API Tests with Language Models and Test Specifications

Top Performing Models

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates