TLDR: ZeST is a novel robotics navigation system that uses Large Language Models (LLMs) to predict terrain traversability in real-time for unknown environments, eliminating the need for dangerous data collection. It segments images, queries an LLM for traversability scores, models uncertainty using a Normal-Inverse-Gamma distribution, and uses risk-aware path planning (RRT*) and control (MPPI) to guide robots safely. Experiments show ZeST outperforms other state-of-the-art methods in both indoor and outdoor settings, demonstrating a robust and safer approach to autonomous navigation.
The world of robotics and autonomous navigation is constantly evolving, with a critical challenge being the ability of robots to accurately assess and navigate diverse terrains. Traditionally, training robots to understand terrain traversability has involved putting them in potentially hazardous environments, risking equipment damage and safety. This labor-intensive process often requires extensive manual labeling and expert annotations, making it costly and time-consuming.
A new approach called ZeST (Zero-Shot Traversability) is changing this paradigm. Developed by researchers, ZeST leverages the visual reasoning capabilities of Large Language Models (LLMs) to create real-time traversability maps without exposing robots to danger. This innovative method not only enables zero-shot traversability – meaning the robot can navigate unknown environments without prior training data for that specific terrain – but also significantly accelerates the development of advanced navigation systems, offering a cost-effective and scalable solution.
How ZeST Works
ZeST operates as a modular navigation system designed for unstructured and unknown environments. Its core objective is to allow autonomous robots to navigate safely and efficiently without needing prior knowledge or extensive data collection. Here’s a breakdown of its key components:
-
Mask Generation: Before querying an LLM, ZeST pre-processes input images. It uses off-the-shelf models like Segment Anything Model (SAM) or Simple Linear Iterative Clustering (SLIC) to automatically segment images into distinct regions based on visual similarity. These regions are then assigned unique identifiers, creating a numbered version of the image for the LLM.
-
Querying the Large Language Model: ZeST then queries a multimodal LLM (like GPT-4o) to predict traversability for each segmented region. The prompts provided to the LLM include contextual information about the robot’s characteristics (e.g., size, mobility) and examples of terrain types with their corresponding traversability values. The LLM processes this input and outputs a list of traversability values for each region.
-
Learning a Traversability Distribution: Recognizing that LLM predictions can vary, ZeST models traversability as a latent probabilistic distribution rather than a single value. It uses a Normal-Inverse-Gamma (NIG) distribution to capture both aleatoric uncertainty (inherent measurement noise) and epistemic uncertainty (uncertainty due to limited data), providing a more robust representation of terrain.
-
Risk Assessment: To ensure safe navigation, ZeST quantifies risk using the Conditional Value at Risk (CVaR) metric. This involves computing the expected value of traversability given that it falls below a certain threshold, effectively identifying the worst-case scenarios within a given area. This risk information is crucial for making informed navigation decisions.
-
Traversability-based Path Planning: ZeST employs a sampling-based RRT* (Rapidly-exploring Random Tree Star) algorithm for path planning. Unlike traditional methods that only check for collisions, ZeST’s RRT* incorporates the CVaR cost and epistemic uncertainty into its cost function. This guides the planner to favor routes that are not only short but also safer and easier for the robot to navigate, especially in areas with high uncertainty.
-
Traversability-based Model Predictive Controller: For real-time control, ZeST uses a Model Predictive Path Integral (MPPI) controller. This controller samples random actions and minimizes a cost function that balances accurate path tracking with maximizing traversability (safety). Importantly, it includes a speed-conditioned epistemic uncertainty cost, prompting the robot to slow down in unknown areas to gather more information and reduce uncertainty.
Real-World Performance
ZeST was implemented on a TerraSentia robot, equipped with a LiDAR and a Jetson AGX for onboard computation, along with a GSM router for online GPT-4o API calls. The system was rigorously tested in both controlled indoor and unstructured outdoor environments, and its performance was compared against state-of-the-art methods like NoMaD and CoNVOI.
The results were compelling: ZeST achieved a 100% success rate in indoor cluttered environments (10 out of 10 runs) and outdoor forest-like environments (5 out of 5 runs). In contrast, NoMaD and CoNVOI struggled, demonstrating that ZeST’s zero-shot approach and robust uncertainty modeling provide superior generalization capabilities in novel settings.
While querying large LLMs can introduce latency (typically 1-2.5 seconds, with occasional spikes up to 5 seconds), ZeST addresses this by generating a 10-meter Octomap and slowing down the robot in unknown areas. This allows the robot to update its map and learn the true distribution of the location, enhancing safety. For mask generation, ZeST opts for SLIC over SAM due to its significantly faster processing time (0.1 seconds vs. 1 second per image) while yielding similar LLM responses.
Also Read:
- Enhancing Robot Safety in Dual-Arm Operations with SafeBimanual
- Brain-Inspired AI Agents Gain Human-Like Spatial Intelligence for Navigation and Manipulation
Conclusion
ZeST represents a significant step forward in autonomous navigation. By integrating multimodal LLMs with probabilistic mapping, it enables robots to create global traversability maps in a zero-shot manner, eliminating the need for dangerous physical interaction during training. The system’s ability to quantify and manage uncertainty, combined with its efficient path planning and control, results in safer and more efficient navigation. This research paves the way for more robust and autonomous robotic systems capable of understanding complex environments. You can read the full research paper here.


