TLDR: This research paper explores how to effectively apply transformer architectures in online model-free reinforcement learning for continuous control. It addresses key design challenges such as input conditioning, actor-critic component sharing, and sequential data slicing. The authors demonstrate that with specific architectural and training strategies, transformers can achieve competitive performance against established baselines across various fully and partially observable tasks, including vector- and image-based settings, providing practical guidance for their stable application in online RL.
Transformers, the powerful neural network architecture that has revolutionized fields like natural language processing and computer vision, have remained a less explored territory in online model-free reinforcement learning (RL). This is largely due to their sensitivity to specific training setups and architectural choices. However, new research sheds light on how to effectively integrate transformers into online RL for continuous control tasks, demonstrating their potential as strong and stable baselines.
The core challenge with transformers in online RL stems from decisions around structuring policy and value networks, sharing components, and handling temporal information. Unlike offline or model-based RL, where transformers have seen significant success by reframing learning as sequence modeling over pre-collected datasets, online RL involves direct interaction with the environment, posing unique stability and optimization hurdles.
This paper investigates key design questions to make transformers viable in this challenging domain. The researchers focused on three critical areas: how to condition inputs to the transformer, how to share components between the actor and critic networks, and how to slice sequential data for training.
Input Conditioning for Better Performance
One of the first hurdles is determining how to feed relevant information into the transformer. In partially observable environments, where agents don’t have a complete view of the state (e.g., masked velocity information), the way inputs are conditioned significantly impacts performance. The study explored several methods, including feeding only observations, interleaving observations with past actions, concatenating embeddings of observations, actions, and rewards, and using a cross-attention mechanism.
The findings revealed that for partially observable tasks, combining embeddings of observations, actions, and rewards into a single input vector (the ‘EmbedConcat’ method) consistently improved performance and training stability. This approach allows the attention mechanism to focus purely on temporal dependencies within a homogeneous sequence. For tasks with full observability, simply using observations was sufficient.
Actor-Critic Backbone Sharing: A Balancing Act
In reinforcement learning, the actor network decides actions, while the critic network evaluates those actions. Sharing a transformer backbone between these two components can reduce the number of parameters, but it introduces a significant challenge: conflicting gradient signals. The actor aims to maximize rewards, while the critic minimizes prediction errors, and these opposing objectives can destabilize learning.
The research showed that using separate transformer backbones for the actor and critic ensures stable training, albeit at a higher computational cost. A more efficient and stable compromise was found by sharing the transformer backbone but ‘freezing’ it during critic updates. This prevents the critic’s gradients from interfering with the actor’s learning, maintaining stability without the need for two entirely separate networks.
Optimizing Data Slicing for Sequential Learning
Transformers process data sequentially, and how this sequential data is prepared and presented during training is crucial. The study examined two main ways to handle the transformer’s output: predicting actions or Q-values only from the last hidden state, or predicting from every hidden state in the sequence.
A key insight emerged regarding ‘cross-episode slicing’ when predicting from only the last token. This method allows input sequences to span across episode boundaries, ensuring that early-episode data is included in the context. This is vital for the agent to learn effective behaviors from the very beginning of an episode, preventing suboptimal actions in the initial steps. While predicting from every token also works, cross-episode slicing with last-token prediction offers a stable and efficient alternative without added computational overhead.
Also Read:
- Diffusion Models Reshape Reinforcement Learning: A New Era for AI Decision-Making
- Actor-Critic Algorithms Maintain Efficiency with Dynamic Reward Functions
Competitive Performance Across Diverse Environments
By integrating these practical takeaways, the researchers developed transformer-based agents that were rigorously tested against strong baselines like MLPs, LSTMs, CNNs, and other transformer variants across various continuous control tasks. These included standard MuJoCo environments (both fully and partially observable) and complex robotic manipulation tasks from ManiSkill3 (vector- and image-based).
The results consistently showed that the transformer setup, when configured with the proposed strategies, achieved performance comparable to or even surpassed these established baselines. This demonstrates that transformers, often perceived as unstable in online RL, can indeed be competitive and robust alternatives.
This work offers valuable insights and practical guidance for applying transformers in online reinforcement learning, transforming them from a notoriously sensitive model into a powerful and stable tool for continuous control. For more in-depth technical details, you can read the full research paper here.


