TLDR: A new hierarchical deep reinforcement learning algorithm, HDDPG, significantly improves autonomous maze navigation for mobile robots. It uses high-level planning for subgoals and low-level control for actions, enhanced by off-policy correction, adaptive exploration noise, and a refined reward system. Experiments show HDDPG achieves much higher success rates and average rewards compared to standard DDPG and its variants in complex maze environments.
Autonomous navigation for mobile robots, especially in complex environments like mazes, presents a significant challenge. Robots need to find efficient paths while avoiding obstacles, often without a complete map of their surroundings. Traditional methods that rely on pre-built global maps are often impractical in unknown or dynamic settings, limiting exploration and adaptability.
To address these limitations, researchers have turned to mapless navigation approaches, which use local environmental information. Reinforcement Learning (RL), a method where agents learn through trial and error by maximizing rewards from their environment, has shown promise. When combined with Deep Learning (DL), which allows computational models to learn from high-dimensional data, it forms Deep Reinforcement Learning (DRL). DRL enables robots to learn optimal policies directly from interactions with their environment, making it particularly advantageous for mapless navigation.
One prominent DRL algorithm for continuous action spaces is the Deep Deterministic Policy Gradient (DDPG). While DDPG excels in many robotic control tasks, its application to complex maze navigation has faced hurdles. These include difficulties with sparse rewards (where positive feedback is rare), inefficient exploration strategies, and challenges in planning over long distances, often leading to low success rates and poor performance.
Introducing Hierarchical Deep Deterministic Policy Gradient (HDDPG)
To overcome these shortcomings, a new approach called Hierarchical DDPG (HDDPG) has been proposed. This innovative algorithm breaks down the complex maze navigation task into a more manageable two-level structure. The high-level policy acts as a strategic planner, using an advanced DDPG framework to generate intermediate “subgoals” from a long-term perspective. These subgoals guide the robot towards the final destination, providing a favorable, collision-free direction and long-term path planning. The low-level policy acts as the tactical executor. Also powered by an improved DDPG algorithm, it takes the current environmental observations and the subgoal assigned by the high-level policy to generate precise, primitive actions, such as linear and angular velocities, to reach that specific subgoal. This hierarchical structure simplifies the overall task, making learning more efficient by allowing the high-level policy to focus on strategic paths and the low-level policy to handle precise motion controls.
Key Innovations for Enhanced Performance
The HDDPG algorithm incorporates several crucial enhancements to boost its stability, efficiency, and exploration capabilities. First, Off-policy Correction: A common issue in hierarchical DRL is that as the low-level policy evolves, historical experiences stored in the replay buffer might become inconsistent with the current policy. HDDPG addresses this by introducing an off-policy correction method that relabels past subgoals in the high-level experience buffer. This ensures that historical data aligns more accurately with what the current low-level policy would achieve, leading to more precise value estimates and stable training.
Second, Adaptive Parameter Space Noise: Instead of adding random noise directly to the robot’s actions, HDDPG applies adaptive noise to the parameters (weights and biases) of the actor networks. This approach promotes more consistent and effective exploration. The magnitude of this noise is dynamically adjusted based on the agent’s learning progress, ensuring that exploration remains efficient throughout training and helps avoid getting stuck in suboptimal solutions.
Third, Reshaped Intrinsic-Extrinsic Reward Function: The paper introduces a sophisticated reward system to guide the robot’s learning. The low-level controller receives “intrinsic” rewards based on its progress towards the current subgoal (e.g., positive for getting closer, negative for moving away, and a large penalty for collisions). The high-level controller receives “extrinsic” rewards for reaching the final goal and also incorporates the cumulative rewards from the low-level policy. This combined reward function provides continuous and detailed feedback, accelerating the learning process and improving the robot’s understanding of the task.
Finally, Further Optimizations: Techniques like gradient clipping (to prevent large updates that can destabilize training) and Xavier initialization (for setting initial network weights to ensure consistent variance across layers) are also employed to improve the overall robustness and stability of the algorithm.
Also Read:
- Adaptive Learning for Robots: GACL’s Approach to Complex Tasks
- Smarter Robot Teaching: How ASkDAgger Reduces Human Effort in Learning
Rigorous Evaluation and Promising Results
The proposed HDDPG algorithm was rigorously evaluated through numerical simulation experiments using the Robot Operating System (ROS) and Gazebo, a 3D simulation environment. The experiments involved a TurtleBot3 mobile robot navigating through three distinct maze scenarios with varying final target locations, ranging from easier to more complex. The performance of HDDPG was compared against the standard DDPG algorithm and its variant, D4PG, using two key metrics: success rate (SR) and average score (AS).
The results demonstrated HDDPG’s significant superiority. For instance, in the easiest maze scenario, HDDPG achieved an impressive average success rate of 89.90%, dramatically outperforming DDPG (0.75%) and D4PG (33.31%). In more challenging scenarios, where DDPG and D4PG often failed completely with 0% success rates, HDDPG consistently achieved high success rates (e.g., 82.43% in scenario 2 and 70.82% in scenario 3). The average scores also showed similar dramatic improvements, indicating that HDDPG not only succeeded more often but also did so more efficiently.
These findings highlight that HDDPG effectively addresses the limitations of traditional DDPG and its variants in complex maze navigation tasks. By breaking down long-horizon problems, enhancing exploration, and refining reward mechanisms, HDDPG provides a more reliable, stable, and scalable solution for autonomous mobile robot navigation. For more in-depth details, you can refer to the full research paper: Hierarchical Deep Deterministic Policy Gradient for Autonomous Maze Navigation of Mobile Robots.


