TLDR: Human-Guided Distributional Soft Actor-Critic (H-DSAC) is a new human-in-the-loop reinforcement learning method designed for safe and efficient real-world autonomous driving. It integrates human expertise by using a distributional proxy value function to guide policy learning, rewarding expert actions and penalizing interventions. This approach improves sample efficiency and safety, enabling autonomous vehicles to learn complex driving policies directly in real environments within practical training times, as validated in both simulations and real-world experiments.
Autonomous driving holds immense promise for transforming transportation, offering benefits like enhanced road safety, reduced traffic congestion, and expanded mobility. However, bringing this technology to life, especially through reinforcement learning (RL), faces significant hurdles. Traditional RL methods often struggle with safety during the learning process, requiring vast amounts of data, and finding it difficult to transfer knowledge from simulations to the unpredictable real world.
The core challenges for RL in autonomous driving include poor sample efficiency, meaning the system needs extensive data collection which can be costly and risky, particularly for rare but critical events. Safety during training is paramount, as trial-and-error exploration can lead to dangerous maneuvers. Designing effective reward functions is also complex, as driving involves balancing multiple objectives like safety, comfort, and efficiency. Furthermore, the ‘sim-to-real’ gap often causes models trained in simulation to perform poorly when deployed in actual vehicles.
To address these issues, researchers are increasingly turning to human-in-the-loop (HIL) reinforcement learning. Human experts possess invaluable insights into driving tasks, which can significantly boost exploration efficiency and reduce the reliance on risky trial-and-error learning. HIL methods establish a feedback loop where human experts can actively participate, refine policies, provide guidance when needed, or offer feedback on collected driving trajectories. This not only improves learning efficiency but also simplifies the challenge of designing complex reward functions.
A new method, Human-Guided Distributional Soft Actor-Critic (H-DSAC), has been proposed to tackle these challenges head-on. This innovative approach integrates human guidance directly into the learning process, aiming for safer, more robust, and sample-efficient autonomous driving in real-world environments. H-DSAC combines two key components: Proxy Value Propagation (PVP) and Distributional Soft Actor-Critic (DSAC).
The central idea behind H-DSAC is the creation of a distributional proxy value function. This function is designed to capture human intent by assigning higher expected returns to actions demonstrated by human experts and penalizing actions that require human intervention. By extrapolating these ‘labels’ to states where no human input was given, the system effectively guides the autonomous driving policy towards expert-like behavior. This process allows the agent to learn fundamental driving skills efficiently and safely, balancing human expertise with autonomous discovery.
The H-DSAC framework operates with two types of data buffers: a ‘novice buffer’ for data collected by the agent during its own exploration, and a ‘human buffer’ for expert demonstrations provided by a human supervisor. Initially, human intervention is more frequent, guiding the randomly initialized agent. As the agent’s policy improves and converges towards expert-level driving, the need for human intervention gradually decreases, leading to fully autonomous capability.
The effectiveness of H-DSAC was rigorously tested in both simulation and real-world environments. In simulation experiments conducted on the MetaDrive safety benchmark, H-DSAC significantly outperformed standard RL algorithms like SAC, PPO, and DSAC, as well as offline RL methods (CQL, BC) and other HIL approaches (HG-DAgger, IWR, PVP). It achieved higher episodic returns, lower safety costs (fewer collisions), and a higher success rate in navigating diverse scenarios.
Perhaps most impressively, H-DSAC demonstrated its capability in real-world experiments using an Unmanned Ground Vehicle (UGV) on the campus roads of Tianjin University. The vehicle was trained on a specific route for approximately two hours (100,000 steps). During the initial phase, human takeovers were frequent due to the untrained policy. However, as training progressed, the system gradually improved, with the takeover rate decreasing. Even when faced with complex scenarios involving pedestrians, cyclists, and other vehicles, the system adapted and became more robust. By 80,000 steps, the policy stabilized, allowing the vehicle to complete the route independently without human intervention.
The real-world tests showcased the UGV’s ability to handle various complex driving situations, including maintaining stable lane positions, executing left turns while avoiding pedestrians, yielding to crossing pedestrians, performing sharp turns, maneuvering around obstacles, navigating past stationary vehicles, and managing intersections with heavy traffic. These results underscore H-DSAC’s potential to enable efficient and safe real-world autonomous driving policy learning within practical training times.
Also Read:
- Teaching Robots Complex Skills: A New Approach to Grounded Skill Discovery
- SureSim: Enhancing Robot Policy Evaluation Through Smart Simulation Integration
In conclusion, H-DSAC represents a significant step forward in autonomous driving. By effectively integrating human feedback through a novel proxy value propagation mechanism, it enhances sample efficiency, safety, and overall performance. This approach minimizes the need for explicit reward engineering and ensures robust learning, paving the way for safer and more capable self-driving vehicles on our roads. For more technical details, you can refer to the full research paper here.


