AI-Powered Insights: Automating Usability Evaluation with Multimodal Language Models

TLDR: A research paper explores how multimodal Large Language Models (LLMs) can automate usability evaluation by analyzing software interface screenshots and generating ranked lists of usability issues. The study found that LLMs perform better with structured evaluation methods like cognitive walkthroughs compared to abstract heuristics. While not a full substitute for human experts, LLMs show potential in assisting and accelerating the identification and prioritization of critical usability improvements, especially for organizations with limited resources.

Ensuring that software applications are easy and intuitive to use is crucial for their success. This quality, known as usability, directly impacts how users interact with a system and can influence everything from customer satisfaction to revenue. Traditionally, evaluating usability involves methods like usability testing and expert inspections, which are effective but often require significant resources and specialized knowledge, making them less accessible for smaller organizations.

A recent research paper, Towards Recommending Usability Improvements with Multimodal Large Language Models, explores a promising new approach: using multimodal Large Language Models (LLMs) to automate parts of the usability evaluation process. Unlike traditional LLMs that only process text, multimodal LLMs can analyze various forms of input, including text, images, and even structural aspects of software interfaces. This capability opens doors for more efficient and cost-effective usability assessments.

The core idea presented in the paper is to frame usability evaluation as a recommendation task. In this setup, multimodal LLMs analyze software interfaces and generate ranked lists of potential usability issues, prioritized by their severity. This dynamic generation of recommendations is a departure from traditional recommender systems that rely on fixed catalogs of items.

How the LLM-Powered Evaluation Works

The process involves providing the LLM with an evaluation context. This context typically includes a general description of the application, a persona definition (describing the intended user), specific criteria to guide the evaluation, and screenshots of the application interface. The LLM then assesses the usability based on these inputs, focusing on one evaluation criterion at a time, such as a specific Nielsen heuristic or a cognitive walkthrough question. For Nielsen heuristics, the LLM assigns a school-grade rating (1-4 for passed, 5 for failed), while for cognitive walkthroughs, it makes a binary decision (passed/failed). Based on these ratings, the usability issues are ranked by severity, and the LLM provides explanations for its assessments, much like human experts would.

The Study and Its Findings

To validate this approach, the researchers conducted a proof-of-concept study using the KnowledgeCheckR application, a learning platform. They defined two personas: a teacher creating a quiz and a student taking one. Two experienced usability experts independently evaluated the application using the same guidelines and rating schemes as the LLM-based approach. Their assessments, including screenshots, served as the ground truth for comparison.

Six different LLMs, including general-purpose and reasoning-optimized variants from OpenAI and Google, were evaluated. The study measured agreement with expert assessments using Cohen’s Kappa and assessed the predictive accuracy of recommended usability issues using Hit rate@k and Accuracy@k.

The findings revealed several key insights:

LLMs showed better alignment with more structured evaluations, such as cognitive walkthroughs, which use explicit questions and binary ratings. They struggled more with the abstract nature of Nielsen heuristics and the school-grade rating scale.
While direct agreement on severity ratings was often low, LLMs demonstrated potential in prioritizing critical issues. For instance, starting from k ≥ 3 (meaning among the top 3 recommendations), there was at least one overlapping issue with expert assessments, especially in cognitive walkthroughs. For k ≥ 5, all models showed at least one overlap.
Qualitative analysis indicated that LLMs could identify and describe relevant usability issues similar to experts. However, a current limitation is their inability to fully capture dynamic application interactions from static screenshots, sometimes missing issues related to user input or temporal changes.

Also Read:

Looking Ahead

Despite these encouraging results, the study acknowledges limitations, including its scope (two experts, one application) and the influence of specific LLMs and prompt designs. Future work aims to enhance LLMs’ understanding of usability principles through advanced prompting strategies and, crucially, to incorporate video screen recordings as evaluation context to better capture the dynamic and interactive nature of user interfaces.

In conclusion, while multimodal LLMs are not yet a complete replacement for human experts, they show significant promise in assisting and accelerating usability evaluation. They can be particularly valuable for teams with limited resources or expertise, helping to identify and prioritize critical usability issues more efficiently.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI-Powered Insights: Automating Usability Evaluation with Multimodal Language Models

How the LLM-Powered Evaluation Works

The Study and Its Findings

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates