TLDR: This research paper explores the use of LLM agents for executing natural language (NL) test cases in GUI applications. It addresses key challenges like test case unsoundness (false failures due to ambiguity or unpredictable agent behavior) and execution inconsistency. The authors propose an algorithm with guardrail mechanisms and specialized agents to verify test steps, along with measures for LLM capabilities and execution consistency. Experiments with eight LLMs show that Meta Llama 3.1 70B demonstrates strong performance and high consistency, while highlighting areas for improvement in smaller models and tooling.
Testing graphical user interface (GUI) applications is a crucial but often time-consuming and costly process. Traditionally, this involves writing detailed, executable test scripts, which are difficult to maintain as applications evolve. A promising new approach leverages natural language (NL) test cases, which are easier to develop and understand. Recent advancements in large language models (LLMs) have opened the door to directly executing these NL test cases using LLM agents.
However, this innovative method introduces its own set of challenges. Natural language test cases can be inherently ‘unsound,’ meaning they might incorrectly flag a perfectly working application as faulty. This can happen due to ambiguous instructions or unpredictable behavior from the LLM agent. Furthermore, running the same NL test case multiple times might lead to inconsistent results, undermining the reliability of the testing process.
To tackle these issues, researchers have proposed a novel algorithm designed to execute NL test cases with built-in ‘guardrail mechanisms.’ This algorithm uses specialized agents that dynamically verify each step of a test. It introduces internal actions like ‘readiness’ and ‘observe.’ The ‘readiness’ action checks if the GUI is prepared for the next step, ensuring all necessary elements are present. The ‘observe’ action verifies if the previous action was successful and if the GUI updated as expected. These internal checks help to ensure that every navigation action is both feasible and correctly executed.
The framework also defines specific measures to evaluate how well LLMs perform in test execution and to quantify the consistency of these executions. A key concept introduced is ‘weak unsoundness,’ which helps characterize situations where NL test case execution remains acceptable, even with minor uncertainties, aligning with industrial quality standards like Six Sigma.
Evaluating LLM Agent Performance
An extensive experimental evaluation was conducted using eight publicly available LLMs, ranging in size from 3 billion to 70 billion parameters. These models were tested on various web applications using four different natural language test suites. The experiments aimed to answer two main questions: how effectively can LLM agents perform navigation, readiness checks, and assertions, and how well do the estimated consistency measures align with observed execution consistency?
The results showed a significant variability in performance across the different LLMs and tasks. Notably, Meta Llama 3.1 70B emerged as the top performer, demonstrating acceptable capabilities in NL test case execution with high consistency. This model achieved accuracy means greater than or equal to 98% across all test step categories, with low standard deviation, indicating reliable performance.
Other models, such as Qwen 2.5 7B, DeepSeek R1, and Mistral Devstral 24B, showed mixed capabilities, often performing well in some areas (like readiness checks) but struggling with others (like navigation or assertion evaluation). Smaller models generally faced more difficulties, frequently failing to execute navigation actions correctly. Common reasons for these failures included limited context lengths (the amount of information an LLM can process), challenges in accurately extracting page content, and ambiguous interpretations of natural language instructions by certain LLMs.
Also Read:
- Understanding LLM Verification: How Problem Difficulty, Generators, and Verifiers Interact
- Understanding and Addressing Hallucinations in AI Agents
Consistency in Testing
Regarding execution consistency, the proposed measure proved accurate for LLMs with moderate to strong capabilities in executing NL test cases. For instance, Llama 3.1 70B and Qwen 3 14B showed a low mean relative error (MRE) of 2%, indicating that the estimated consistency closely matched the observed consistency. However, for less capable LLMs like Mistral Nemo 12B, the consistency measure was less accurate, often underestimating the observed consistency. This suggests that while the measure is valuable, more fine-grained metrics might be needed for models with limited abilities.
This research highlights both the immense potential and current limitations of using LLM agents for GUI testing. While models like Llama 3.1 70B show promising results, further improvements are needed in tooling, prompt design, and potentially specialized LLM training to enhance their capabilities. The work also opens avenues for future research, such as generating NL test cases from higher-level scenarios, fine-tuning smaller LLMs for specific tasks, and incorporating screenshots to better detect navigation issues.
For more detailed information, you can refer to the full research paper: On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language.


