TLDR: EfficientNav is a novel system that allows robots to perform object-goal navigation using smaller language models directly on local devices, overcoming the limitations of cloud-based LLMs. It introduces discrete memory caching to efficiently store and reuse navigation map information, attention-based memory clustering for accurate object grouping, and semantics-aware memory retrieval to prune redundant data. This approach significantly boosts navigation success rates and reduces latency, making advanced robot navigation practical for on-device deployment.
Object-goal navigation (ObjNav) is a fascinating and challenging task for robots, where an agent must find a specific object in an unfamiliar environment. Traditionally, advanced ObjNav systems have relied heavily on powerful large language models (LLMs) like GPT-4, which typically run on cloud servers. While effective, this approach comes with significant drawbacks: high communication latency, privacy concerns, and substantial computational costs.
The goal is to enable these intelligent navigation capabilities directly on local devices, such as the NVIDIA Jetson AGX Orin, which have limited memory (e.g., 32GB). However, simply switching to smaller LLMs like LLaMA3.2-11b often leads to a considerable drop in success rates because these models struggle to understand complex navigation maps. Furthermore, the detailed descriptions of these maps can create very long prompts, causing high planning latency on local devices.
Introducing EfficientNav: Smart Navigation for Local Devices
A new research paper, EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval, proposes an innovative solution called EfficientNav. This system is designed to enable efficient, LLM-based, zero-shot ObjNav directly on local devices. EfficientNav tackles the core challenges of limited memory and model capacity through three key innovations:
1. Discrete Memory Caching
One major hurdle is the memory constraint of local devices, which prevents storing the entire KV (Key-Value) cache of navigation map descriptions. Recomputing this cache at each planning step is too slow. EfficientNav addresses this by clustering objects in the navigation map into groups and computing the KV cache for each group independently. This means that only a portion of the relevant groups are selected and loaded into the LLM, significantly reducing memory transfer costs and avoiding redundant computations. This strategy allows the system to reuse saved KV caches even when the order of context changes.
2. Attention-based Memory Clustering
Simply dividing the map into uniform chunks can lead to a loss of important relationships between objects. EfficientNav introduces attention-based memory clustering to group related information more accurately. It uses the LLM’s own attention mechanisms to cluster newly detected objects into existing groups or form new ones. For example, an oven and a pot are more closely related than an oven and a bed. By grouping objects with strong relationships, the LLM can better understand the environment, improving navigation success rates without adding significant computational overhead.
3. Semantics-aware Memory Retrieval
Smaller LLMs can struggle to process and understand complex navigation maps, leading to performance drops. To combat this, EfficientNav employs semantics-aware memory retrieval. This mechanism efficiently prunes redundant map information by using a lightweight CLIP model (around 100M parameters) to assess the semantic similarity between object groups and the final navigation goal. The system then formulates this as a knapsack problem to select the most relevant groups within the device’s memory budget. This ensures the LLM focuses only on crucial information, improving its planning performance and overall success rate.
Also Read:
- DIV-Nav: Guiding Robots with Spatial Understanding for Multi-Object Search
- Optimizing Vision-Language-Action Models for Robotics: A Deep Dive into Efficiency
Impressive Results
Extensive experiments demonstrate EfficientNav’s effectiveness. It achieves an 11.1% improvement in success rate on the HM3D benchmark compared to GPT-4-based baselines. Furthermore, it shows a 6.7 times reduction in real-time latency and a 4.7 times reduction in end-to-end latency compared to a GPT-4 planner. Even when compared to naive LLaMA/LLaVA planners, EfficientNav significantly reduces latency and improves success rates, proving its capability to run advanced ObjNav efficiently on local devices.
While EfficientNav marks a significant step towards on-device robot navigation, the authors note that LLM inference speed, even after acceleration, may not match that of smaller, specialized models. Therefore, applications requiring extremely low real-time latency should consider this trade-off. Nevertheless, EfficientNav opens new possibilities for deploying intelligent, autonomous agents in real-world environments without constant reliance on cloud infrastructure.


