TLDR: The research paper “HRM-Agent: Training a recurrent reasoning model in dynamic environments using reinforcement learning” introduces HRM-Agent, a variant of the Hierarchical Reasoning Model (HRM) trained exclusively with reinforcement learning. This model demonstrates the ability to learn and reason in dynamic and uncertain maze environments by reusing computational effort from previous time-steps. A key feature, “Carry Z,” allows the model to maintain and adapt its internal recurrent state, leading to efficient navigation and faster learning compared to models that reset their state at each step. This work shows HRM’s potential for real-world applications where environments are constantly changing.
Artificial intelligence models have made incredible strides in recent years, particularly in tasks that involve complex reasoning. However, many of these advanced models, like the Hierarchical Reasoning Model (HRM), have primarily excelled in static, predictable environments where all information is available from the start. The real world, however, is rarely so neat. It’s dynamic, uncertain, and often only partially observable, presenting a significant challenge for AI.
A new research paper, HRM-Agent: Training a recurrent reasoning model in dynamic environments using reinforcement learning, introduces a novel approach to bridge this gap. Authored by Long H Dang and David Rawlinson, this work presents HRM-Agent, a variant of the Hierarchical Reasoning Model specifically designed to learn and reason effectively in unpredictable settings using only reinforcement learning.
The Challenge of Dynamic Environments
Traditional reasoning models often struggle when the environment changes. They might generate a long sequence of steps (like a “Chain of Thought”) but if the situation shifts mid-plan, they lack the intrinsic ability to adapt or reuse previous computations. This leads to inefficiency and unreliable performance in real-world scenarios where an agent must constantly integrate new information and adjust its strategy.
The original Hierarchical Reasoning Model (HRM) was notable for its ability to adapt its computational effort to problem difficulty and solve complex tasks like Sudoku and maze planning with remarkable efficiency. It uses a recurrent inference process with dual modules – a “high-level” module (H) for slower, abstract updates and a “low-level” module (L) for faster, detailed updates. However, its application was limited to problems where the “correct” action was well-defined and the environment remained constant.
Introducing HRM-Agent: Learning Through Reinforcement
HRM-Agent takes the core strengths of HRM and adapts them for dynamic, uncertain environments by training it exclusively with reinforcement learning (RL). Unlike supervised learning, where models are given explicit correct answers, RL allows an agent to learn by maximizing rewards received from its actions, making it suitable for problems where the optimal path isn’t predefined.
A key innovation in HRM-Agent is its ability to “carry forward” its internal recurrent state (referred to as ‘z’) from one environment step to the next. Imagine an agent planning a route through a city. If a road suddenly closes, instead of starting its entire planning process from scratch, HRM-Agent can leverage its existing ‘mental map’ and current plan, only adjusting what’s necessary. This mechanism allows the model to integrate and reuse computation from previous time-steps, promoting consistency and efficiency in plan execution.
Navigating Dynamic Mazes
To test HRM-Agent’s capabilities, the researchers used two types of dynamic maze environments:
- Four-rooms environment: A classic maze with four rooms connected by doorways. To make it dynamic, one door would randomly close and open, forcing the agent to re-plan its path to the goal.
- Dynamic, random maze environment: A more complex setup with fixed walls, randomly placed temporary walls, and multiple doors that independently open and close. This environment pushed the agent to generalize its planning abilities to entirely novel maze configurations.
The results were highly encouraging. HRM-Agent successfully navigated to goals in approximately 99% of episodes in both environments, demonstrating its ability to plan paths efficiently. Crucially, the “carry Z” variant, which reused its internal state, achieved high goal-achievement and efficient path lengths faster than the “reset Z” variant, which started fresh at each step. This provides strong evidence that the model was indeed reusing its previous computations and plans, adapting them as the environment changed.
Also Read:
- DeepAgent: Advancing AI with Autonomous Reasoning and Dynamic Tool Use
- AI Agents Learn to Manage Their Own Computational Effort
Implications and Future Directions
This research provides a proof of concept that recurrent reasoning models like HRM can be effectively trained with reinforcement learning to tackle dynamic and uncertain problems. The ability to maintain and adapt an internal plan across changing environments is a significant step towards more robust and intelligent AI agents.
The authors plan to further enhance HRM-Agent by restoring its Adaptive Computation Time (ACT) feature, allowing it to optimize its “thinking time” dynamically. They also aim to explore more complex environments, including those with partial observability, and investigate its potential for continual and few-shot learning, paving the way for AI that can learn and adapt continuously in the real world.


