TLDR: Screen2AX is a new vision-based framework that automatically generates detailed, hierarchical accessibility metadata for macOS applications from screenshots. This addresses the common problem of incomplete or missing accessibility information in many apps, which hinders screen readers and AI agents. By using deep learning models for UI element detection, description, and grouping, Screen2AX significantly improves the ability of AI agents to understand and interact with macOS interfaces, outperforming existing methods and native accessibility features.
Many applications on macOS, despite years of progress in accessibility standards, still fall short in providing the necessary features for users with diverse accessibility needs. This often means that important information, like what a button does or where an element is located, is either incomplete or entirely missing. This lack of proper accessibility data makes it difficult for tools like screen readers, which visually impaired users rely on, to function effectively. It also hinders the ability of artificial intelligence (AI) agents to understand and interact with complex desktop interfaces, leading to automation failures.
A recent investigation revealed that only about a third of macOS applications offer full accessibility support, with many providing only partial or no support at all. This problem is particularly pronounced in less popular applications. The challenge is that developers often have to manually add or update this accessibility information for custom interface elements, a process that is both complex and time-consuming, and prone to errors.
Introducing Screen2AX: A Vision-Based Solution
To address this significant gap, researchers have introduced Screen2AX, a groundbreaking framework designed to automatically create real-time, tree-structured accessibility metadata directly from a single screenshot of a macOS application. Screen2AX uses advanced computer vision and language models to detect, describe, and organize user interface (UI) elements in a hierarchical way, mimicking how macOS itself structures accessibility information.
The core idea behind Screen2AX is to use visual input – a screenshot – to understand the layout and function of an application’s interface. This is a significant step forward because it doesn’t rely on developers to manually provide the data. The system processes a UI screenshot through several key stages:
- UI Element Detection: It first identifies and categorizes all visible elements on the screen, such as buttons, text fields, images, and links.
- Text Detection: It extracts any on-screen text using Optical Character Recognition (OCR).
- UI Element Description: For elements without clear text labels, especially icon-only buttons, it generates semantic descriptions to explain their purpose.
- Grouping UI Elements: It then organizes these detected elements into logical, meaningful groups, like toolbars or side panels.
- Hierarchy Generation: Finally, it builds a complete hierarchical representation of the UI, showing parent-child relationships and how elements are nested within groups.
Public Datasets and Performance
To overcome the scarcity of data for macOS desktop applications, the team behind Screen2AX compiled and publicly released three comprehensive datasets. These datasets, encompassing 112 macOS applications, are annotated for UI element detection, grouping, and hierarchical accessibility metadata, along with corresponding screenshots. This is a valuable resource for future research in accessibility generation.
Screen2AX has shown impressive results. It accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. More importantly, these detailed hierarchy trees significantly improve the ability of autonomous AI agents to interpret and interact with complex desktop interfaces. On a new benchmark called Screen2AX-Task, designed specifically for evaluating AI agent task execution in macOS environments, Screen2AX delivered a 2.2 times performance improvement over native accessibility representations. It also surpassed the state-of-the-art OmniParser V2 system on the ScreenSpot benchmark.
The project is open-source and available for public use, which you can find more about by visiting the research paper.
Challenges and Future Directions
Despite its strong performance, Screen2AX faces certain limitations. The quality of the dataset, for instance, is influenced by inconsistencies in developer-provided accessibility annotations. Also, the model for describing UI icons was initially trained on mobile UI datasets, which means it sometimes struggles with the more complex or unique icons found in desktop applications. The speed of processing, while good, might also need further optimization for truly real-time applications.
Looking ahead, future research could focus on expanding the dataset to include a wider variety of desktop-specific icons and further optimizing the model for faster inference. Another promising direction is to classify UI element groups by their semantic roles, such as identifying them as toolbars or navigation bars, which could further enhance agent navigation and user experience for assistive technologies.
Also Read:
- Enhancing AI’s Digital Interface Interaction with Continuous Spatial Rewards
- Gaze-Guided Robots: Enhancing Efficiency and Robustness with Human-Inspired Vision
Conclusion
Screen2AX represents a significant leap forward in making macOS applications more accessible. By automating the generation of rich, hierarchical accessibility metadata directly from screenshots, it addresses a critical need for both human users relying on assistive technologies and AI-driven autonomous agents. This vision-based approach has the potential to overcome common accessibility challenges and pave the way for more inclusive computing experiences across various operating systems.


