TLDR: The BTL (Blink-Think-Link) framework is a brain-inspired model for AI-driven GUI interaction, decomposing it into rapid visual detection (Blink), high-level reasoning (Think), and precise action generation (Link). It introduces automated Blink Data Generation and a process-outcome integrated BTL Reward mechanism. The resulting BTL-UI agent achieves state-of-the-art performance in GUI understanding and interaction tasks, making AI-GUI interaction more natural and efficient.
In the rapidly evolving world of AI, automating how we interact with graphical user interfaces (GUIs) is a major step towards truly intelligent digital assistants. While current AI models have made significant strides, their interaction methods often don’t quite match the natural way humans engage with screens.
To bridge this gap, researchers at MiLM Plus, Xiaomi Inc., have introduced a new framework called “Blink-Think-Link” (BTL). This innovative model is inspired by how the human brain processes information and makes decisions when using a computer or phone interface. It breaks down complex interactions into three distinct, biologically-inspired stages:
Blink Phase
Imagine your eyes quickly scanning a screen, instantly spotting the most important areas. This is what the Blink phase mimics. It’s about rapidly detecting and focusing attention on relevant parts of the screen, much like our saccadic eye movements. This helps the AI agent quickly identify key elements without getting overwhelmed by visual clutter.
Think Phase
After spotting the relevant areas, humans then engage in higher-level thinking and decision-making. The Think phase in BTL mirrors this cognitive planning. Here, the AI integrates various pieces of information and reasons about the best course of action to achieve a specific goal.
Also Read:
- Smart Navigation: A Hybrid AI Approach for Visually Impaired Mobility
- Cognitive-Inspired AI: A New Method for Attention Management in Transformers
Link Phase
Finally, once a decision is made, humans execute precise actions. The Link phase is where the BTL model generates executable commands for precise motor control, emulating how we select and perform actions like tapping a button or typing text.
The BTL framework also introduces two key technical innovations to make this process even more effective. First, “Blink Data Generation” is an automated system that creates annotations for the ‘blink’ phase, helping the AI learn which screen areas are most important. Second, “BTL Reward” is a unique rule-based reward system for reinforcement learning. Unlike traditional systems that only reward the final outcome, BTL Reward guides the AI through both the interaction process and the final result, leading to more sophisticated learning.
Building on this framework, the researchers developed a GUI agent model named BTL-UI. This agent has shown impressive, consistent, and state-of-the-art performance across various tasks, including understanding static GUI layouts and performing dynamic interactions. This success provides strong evidence that the BTL framework is highly effective for creating advanced GUI agents.
The BTL framework represents a significant step forward in making AI-driven GUI interactions more natural and efficient, aligning them closer to human cognitive processes. For more details, you can read the full research paper here.


