TLDR: VL-KnG is a system that helps robots understand visual scenes for navigation by building spatiotemporal knowledge graphs from video. It overcomes limitations of traditional vision-language models by providing persistent scene memory, improved spatial reasoning, and computational efficiency. The system processes video in chunks, maintains object identity over time using semantic association, and uses a GraphRAG approach for efficient query processing. It was evaluated on a new benchmark, WalkieKnowledge, and demonstrated practical applicability on a real robot, matching the performance of advanced VLMs like Gemini 2.5 Pro while offering explainable reasoning.
Robots navigating complex, unstructured environments need more than just basic vision; they require a deep understanding of their surroundings, including spatial relationships and how objects change over time. Traditional vision-language models (VLMs), while powerful, often struggle with maintaining a persistent memory of a scene, performing intricate spatial reasoning, and scaling efficiently for real-time applications, especially with long video sequences.
Addressing these critical challenges, researchers have introduced VL-KnG (Vision-Language Knowledge Graph), a novel system designed for Visual Scene Understanding. VL-KnG tackles these limitations head-on by constructing spatiotemporal knowledge graphs and employing computationally efficient query processing to identify navigation goals.
How VL-KnG Works
The core of VL-KnG lies in its ability to process video sequences in manageable chunks. It leverages modern VLMs to extract detailed descriptors for objects within each chunk. Instead of treating each chunk in isolation, VL-KnG iteratively builds a persistent knowledge graph. This graph is crucial because it maintains the identity of objects over time, even if their appearance changes due to lighting, occlusion, or different viewpoints. This “semantic-based association” uses large language models to understand if an object seen in one video chunk is the same as an object seen in another, rather than relying solely on visual similarity.
The knowledge graph stores rich semantic information about each object, including its color, material, size, potential uses (affordances), and its spatial relationships with other entities. This structured representation acts as a persistent memory of the environment, allowing for advanced, explainable spatial reasoning through queryable graph structures.
Efficient Navigation Goal Identification
For identifying navigation goals, VL-KnG employs a GraphRAG-based query processing pipeline. When a robot receives a natural language query (e.g., “Find the red chair next to the wooden table”), the system first breaks down the query into key entities, relationships, and constraints. It then efficiently retrieves only the relevant subgraphs from the larger knowledge graph. Finally, using large language model reasoning, it processes this retrieved subgraph to pinpoint the most relevant video frames or locations for the goal, considering both spatial relationships and temporal dynamics.
This retrieval-based approach offers significant computational efficiency. While a general-purpose VLM might take minutes to process a query, VL-KnG can provide answers in approximately one second. This speed is vital for real-time deployment in tasks like localization, navigation, and planning.
Introducing WalkieKnowledge Benchmark
To objectively evaluate VL-KnG and other similar methods, the researchers also introduced WalkieKnowledge, a new benchmark dataset. This benchmark features about 200 manually annotated questions across eight diverse trajectories, spanning approximately 100 minutes of video data. It includes four unique query types: object search, scene description, action-place association, and spatial relationship queries. WalkieKnowledge allows for a fair comparison between structured approaches like VL-KnG and general-purpose VLMs, assessing performance through metrics like retrieval accuracy, answer accuracy, and ranking quality.
Also Read:
- AI Agents Navigate Smarter with Landmark-Guided Knowledge
- Unlocking Video Content: A Framework for Knowledge Graph Creation and Querying
Real-World Validation
VL-KnG’s practical applicability was demonstrated through real-world deployment on a differential drive robot. The system achieved a 77.27% success rate and 76.92% answer accuracy, remarkably matching the performance of a powerful VLM like Gemini 2.5 Pro. Crucially, VL-KnG provides explainable reasoning, a significant advantage over black-box VLM approaches, making its decisions transparent and understandable. For more details, you can read the full research paper here.
In conclusion, VL-KnG represents a significant step forward in robot navigation, offering a robust, efficient, and explainable framework for visual scene understanding by combining the strengths of modern VLMs with persistent spatiotemporal knowledge graphs.


