TLDR: This paper introduces two new attention mechanisms, Adaptive Attention and Gaussian Attention, for Transformer-based reinforcement learning agents operating in partially observable environments. Integrated into the UniZero agent, Gaussian Attention significantly improves performance on the Atari 100k benchmark by smoothly prioritizing informative past experiences, demonstrating that flexible temporal priors are more effective than rigid memory windows for efficient learning in sparse data settings.
Reinforcement Learning (RL) is a powerful framework for training artificial intelligence to make decisions in sequential environments. However, many real-world tasks present a challenge known as ‘partial observability,’ where the AI agent doesn’t have a complete picture of its environment. To overcome this, agents must learn to use their past experiences to make informed decisions.
Recent advancements have seen the rise of Transformers, a type of neural network architecture, in model-based RL. These Transformers are excellent at understanding long-term relationships in data, similar to how they excel in natural language processing. A notable example is UniZero, an RL agent that uses a Transformer as its ‘world model’ to plan actions under partial observability.
However, a key difference between natural language and RL data is that RL experiences are often sparse and reward-driven. Standard Transformer attention mechanisms tend to distribute their focus uniformly across all past information, which can be inefficient when only a few past events are truly critical for making good decisions. This is especially true in low-data scenarios where every piece of information counts.
To address this, researchers Daniel De Dios Allegue, Jinke He, and Frans A. Oliehoek from Delft University of Technology introduced two new structured attention mechanisms into UniZero’s dynamics model. These mechanisms are designed to help the AI ‘learn to focus’ on the most informative parts of its history. The paper, titled Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning, details these innovations.
Two New Attention Priors
The first mechanism is a ‘memory-length prior,’ implemented as Adaptive Attention. This allows each attention head within the Transformer to learn a specific, limited window of past events to focus on. The idea is that for some tasks, only the most recent actions and observations are truly relevant.
The second, and more impactful, mechanism is a ‘distributional prior,’ implemented as Gaussian Attention. Instead of a hard cutoff, this approach applies a smooth, Gaussian-shaped weighting over past experiences. This means that past state-action pairs that are more relevant to the current situation receive a higher ‘attention weight,’ allowing the model to smoothly emphasize important transitions without completely ignoring others.
These mechanisms were integrated into UniZero, a model-based RL agent that uses a Transformer to predict future states and rewards. The dynamics head of UniZero, responsible for these predictions, was enhanced with these new attention priors.
Experimental Results and Key Findings
The researchers tested their enhanced UniZero agent on the Atari 100k benchmark, a standard testbed for sample efficiency in RL. The results were striking: Gaussian Attention achieved a significant 77% relative improvement in mean human-normalized scores over the standard UniZero. It also doubled the human-normalized median score, outperforming the baseline in 19 out of 26 games.
The success of Gaussian Attention largely comes from its ability to smoothly allocate attention across both immediate and moderately delayed dependencies. This flexibility allows it to capture relevant temporal patterns without imposing rigid boundaries.
In contrast, Adaptive Attention, with its hard memory windows, often struggled. It either cut off useful signals too early or included irrelevant information, leading to inconsistent or weaker performance. Combining both mechanisms (Gaussian Adaptive Attention) also degraded performance, as the hard cutoff of the memory-length prior truncated the beneficial smooth weighting of the Gaussian prior.
These findings suggest a crucial guideline for model-based RL in partially observable environments: smooth, learnable temporal priors are more robust and data-efficient for dynamics modeling than fixed or rigid memory windows. While the study focused on Atari games, the principles could extend to other complex RL domains.
Also Read:
- AI Agents Learn to Plan and Generalize with Human-Like Cognitive Functions
- PADiff: Enhancing AI Teamwork with Predictive and Adaptive Diffusion Policies
Ablation Studies and Future Directions
Further analysis, including ablation studies, confirmed the robustness of the Gaussian prior. It was found that the initial width of the Gaussian distribution (sigma) was particularly important, with narrower initial priors leading to better results. This indicates that a focused starting point helps the model learn effectively.
The research highlights that while Transformers are powerful, tailoring their attention mechanisms to the unique characteristics of RL data—sparse rewards and non-stationary dependencies—is key to unlocking their full potential. By encoding structured temporal priors directly into self-attention, AI agents can better prioritize informative histories, leading to more efficient and robust learning in complex environments.


