spot_img
HomeResearch & DevelopmentAI-Powered Insights: Automating Usability Evaluation with Multimodal Language Models

AI-Powered Insights: Automating Usability Evaluation with Multimodal Language Models

TLDR: A research paper explores how multimodal Large Language Models (LLMs) can automate usability evaluation by analyzing software interface screenshots and generating ranked lists of usability issues. The study found that LLMs perform better with structured evaluation methods like cognitive walkthroughs compared to abstract heuristics. While not a full substitute for human experts, LLMs show potential in assisting and accelerating the identification and prioritization of critical usability improvements, especially for organizations with limited resources.

Ensuring that software applications are easy and intuitive to use is crucial for their success. This quality, known as usability, directly impacts how users interact with a system and can influence everything from customer satisfaction to revenue. Traditionally, evaluating usability involves methods like usability testing and expert inspections, which are effective but often require significant resources and specialized knowledge, making them less accessible for smaller organizations.

A recent research paper, Towards Recommending Usability Improvements with Multimodal Large Language Models, explores a promising new approach: using multimodal Large Language Models (LLMs) to automate parts of the usability evaluation process. Unlike traditional LLMs that only process text, multimodal LLMs can analyze various forms of input, including text, images, and even structural aspects of software interfaces. This capability opens doors for more efficient and cost-effective usability assessments.

The core idea presented in the paper is to frame usability evaluation as a recommendation task. In this setup, multimodal LLMs analyze software interfaces and generate ranked lists of potential usability issues, prioritized by their severity. This dynamic generation of recommendations is a departure from traditional recommender systems that rely on fixed catalogs of items.

How the LLM-Powered Evaluation Works

The process involves providing the LLM with an evaluation context. This context typically includes a general description of the application, a persona definition (describing the intended user), specific criteria to guide the evaluation, and screenshots of the application interface. The LLM then assesses the usability based on these inputs, focusing on one evaluation criterion at a time, such as a specific Nielsen heuristic or a cognitive walkthrough question. For Nielsen heuristics, the LLM assigns a school-grade rating (1-4 for passed, 5 for failed), while for cognitive walkthroughs, it makes a binary decision (passed/failed). Based on these ratings, the usability issues are ranked by severity, and the LLM provides explanations for its assessments, much like human experts would.

The Study and Its Findings

To validate this approach, the researchers conducted a proof-of-concept study using the KnowledgeCheckR application, a learning platform. They defined two personas: a teacher creating a quiz and a student taking one. Two experienced usability experts independently evaluated the application using the same guidelines and rating schemes as the LLM-based approach. Their assessments, including screenshots, served as the ground truth for comparison.

Six different LLMs, including general-purpose and reasoning-optimized variants from OpenAI and Google, were evaluated. The study measured agreement with expert assessments using Cohen’s Kappa and assessed the predictive accuracy of recommended usability issues using Hit rate@k and Accuracy@k.

The findings revealed several key insights:

  • LLMs showed better alignment with more structured evaluations, such as cognitive walkthroughs, which use explicit questions and binary ratings. They struggled more with the abstract nature of Nielsen heuristics and the school-grade rating scale.
  • While direct agreement on severity ratings was often low, LLMs demonstrated potential in prioritizing critical issues. For instance, starting from k ≥ 3 (meaning among the top 3 recommendations), there was at least one overlapping issue with expert assessments, especially in cognitive walkthroughs. For k ≥ 5, all models showed at least one overlap.
  • Qualitative analysis indicated that LLMs could identify and describe relevant usability issues similar to experts. However, a current limitation is their inability to fully capture dynamic application interactions from static screenshots, sometimes missing issues related to user input or temporal changes.

Also Read:

Looking Ahead

Despite these encouraging results, the study acknowledges limitations, including its scope (two experts, one application) and the influence of specific LLMs and prompt designs. Future work aims to enhance LLMs’ understanding of usability principles through advanced prompting strategies and, crucially, to incorporate video screen recordings as evaluation context to better capture the dynamic and interactive nature of user interfaces.

In conclusion, while multimodal LLMs are not yet a complete replacement for human experts, they show significant promise in assisting and accelerating usability evaluation. They can be particularly valuable for teams with limited resources or expertise, helping to identify and prioritize critical usability issues more efficiently.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -