TLDR: A research paper explores how multimodal Large Language Models (LLMs) can automate usability evaluation by analyzing software interface screenshots and generating ranked lists of usability issues. The study found that LLMs perform better with structured evaluation methods like cognitive walkthroughs compared to abstract heuristics. While not a full substitute for human experts, LLMs show potential in assisting and accelerating the identification and prioritization of critical usability improvements, especially for organizations with limited resources.
Ensuring that software applications are easy and intuitive to use is crucial for their success. This quality, known as usability, directly impacts how users interact with a system and can influence everything from customer satisfaction to revenue. Traditionally, evaluating usability involves methods like usability testing and expert inspections, which are effective but often require significant resources and specialized knowledge, making them less accessible for smaller organizations.
A recent research paper, Towards Recommending Usability Improvements with Multimodal Large Language Models, explores a promising new approach: using multimodal Large Language Models (LLMs) to automate parts of the usability evaluation process. Unlike traditional LLMs that only process text, multimodal LLMs can analyze various forms of input, including text, images, and even structural aspects of software interfaces. This capability opens doors for more efficient and cost-effective usability assessments.
The core idea presented in the paper is to frame usability evaluation as a recommendation task. In this setup, multimodal LLMs analyze software interfaces and generate ranked lists of potential usability issues, prioritized by their severity. This dynamic generation of recommendations is a departure from traditional recommender systems that rely on fixed catalogs of items.
How the LLM-Powered Evaluation Works
The process involves providing the LLM with an evaluation context. This context typically includes a general description of the application, a persona definition (describing the intended user), specific criteria to guide the evaluation, and screenshots of the application interface. The LLM then assesses the usability based on these inputs, focusing on one evaluation criterion at a time, such as a specific Nielsen heuristic or a cognitive walkthrough question. For Nielsen heuristics, the LLM assigns a school-grade rating (1-4 for passed, 5 for failed), while for cognitive walkthroughs, it makes a binary decision (passed/failed). Based on these ratings, the usability issues are ranked by severity, and the LLM provides explanations for its assessments, much like human experts would.
The Study and Its Findings
To validate this approach, the researchers conducted a proof-of-concept study using the KnowledgeCheckR application, a learning platform. They defined two personas: a teacher creating a quiz and a student taking one. Two experienced usability experts independently evaluated the application using the same guidelines and rating schemes as the LLM-based approach. Their assessments, including screenshots, served as the ground truth for comparison.
Six different LLMs, including general-purpose and reasoning-optimized variants from OpenAI and Google, were evaluated. The study measured agreement with expert assessments using Cohen’s Kappa and assessed the predictive accuracy of recommended usability issues using Hit rate@k and Accuracy@k.
The findings revealed several key insights:
- LLMs showed better alignment with more structured evaluations, such as cognitive walkthroughs, which use explicit questions and binary ratings. They struggled more with the abstract nature of Nielsen heuristics and the school-grade rating scale.
- While direct agreement on severity ratings was often low, LLMs demonstrated potential in prioritizing critical issues. For instance, starting from k ≥ 3 (meaning among the top 3 recommendations), there was at least one overlapping issue with expert assessments, especially in cognitive walkthroughs. For k ≥ 5, all models showed at least one overlap.
- Qualitative analysis indicated that LLMs could identify and describe relevant usability issues similar to experts. However, a current limitation is their inability to fully capture dynamic application interactions from static screenshots, sometimes missing issues related to user input or temporal changes.
Also Read:
- REFINE: Enhancing Multimodal AI Performance Through Targeted Error Feedback
- Beyond Jailbreaks: Unpacking the True Criminal Potential of Large Language Models
Looking Ahead
Despite these encouraging results, the study acknowledges limitations, including its scope (two experts, one application) and the influence of specific LLMs and prompt designs. Future work aims to enhance LLMs’ understanding of usability principles through advanced prompting strategies and, crucially, to incorporate video screen recordings as evaluation context to better capture the dynamic and interactive nature of user interfaces.
In conclusion, while multimodal LLMs are not yet a complete replacement for human experts, they show significant promise in assisting and accelerating usability evaluation. They can be particularly valuable for teams with limited resources or expertise, helping to identify and prioritize critical usability issues more efficiently.


