TLDR: SmolRGPT is a compact 600-million-parameter vision-language model designed for efficient spatial reasoning in environments like warehouses. It integrates RGB and depth cues through a unique architecture and a three-stage training curriculum. The model achieved 3rd place in the AI City Challenge 2025 Track 3, demonstrating competitive performance on complex spatial tasks such as left-right relations, counting, and distance estimation, often matching or exceeding much larger models like GPT-4, making advanced spatial AI deployable on resource-constrained hardware.
In the rapidly evolving world of artificial intelligence, vision-language models (VLMs) have shown incredible potential for understanding and interacting with the visual world. However, these powerful models often come with a significant drawback: their immense size and computational demands. This makes them challenging to deploy in real-world, resource-constrained environments such as warehouses, robotics, and industrial settings, where efficiency and precise spatial understanding are crucial.
A new research paper introduces SmolRGPT, a compact and efficient vision-language architecture designed to tackle this very challenge. With only 600 million parameters, SmolRGPT aims to provide robust spatial reasoning capabilities without the prohibitive computational and memory requirements of much larger models.
Understanding SmolRGPT’s Approach
SmolRGPT distinguishes itself by explicitly incorporating region-level spatial reasoning. It achieves this by integrating both traditional RGB (color) images and depth cues. This dual input allows the model to understand not just what objects are, but also their three-dimensional arrangement and relationships in space.
The model’s architecture builds upon existing efficient VLM frameworks but introduces key innovations. It uses a shared visual feature extractor (SigLip2) for both RGB and depth images. Crucially, it employs separate pathways—an RGB Connector and Refiner, and a Depth Connector and Refiner—to process these distinct visual cues. This design ensures that the model maintains clear representations for each modality, preventing confusion between color and depth information. A technique called pixel shuffling is used in the RGB Connector to create denser feature representations, which helps in capturing more detailed spatial information.
These refined, region-level features are then integrated into a compact language model, SmolLM2-360M, allowing SmolRGPT to generate natural language responses to complex spatial queries.
A Progressive Training Strategy
To achieve its impressive performance with a smaller footprint, SmolRGPT utilizes a carefully designed three-stage training curriculum:
- RGB Connector Alignment: Initially, the model focuses on general vision-language understanding, training only the RGB connector on a large dataset of image-text pairs (LLaVA-CC3M). This stage establishes a foundational understanding of global scenes.
- Depth Connector and Refiner Warmup: The next stage introduces depth information. The depth connector and both RGB and depth refiners are trained on the Open Spatial Dataset (OSD), which provides extensive 3D spatial annotations. This helps the model begin to grasp spatial relationships.
- Supervised Finetuning: In the final stage, all trainable components are jointly finetuned on a specialized warehouse dataset (PhysicalAI-Spatial-Intelligence-Warehouse dataset). This stage adapts the model to the specific spatial reasoning tasks required in industrial environments, such as distance estimation, object counting, and identifying spatial relations.
Competitive Performance in Warehouse Environments
SmolRGPT’s effectiveness was rigorously evaluated, particularly in the context of the AI City Challenge 2025 Track 3, which focuses on spatial intelligence in warehouses. The model secured 3rd place, demonstrating that a 600M-parameter architecture can compete effectively against significantly larger models.
Key performance highlights include:
- Left-Right Directional Tasks: Achieved an accuracy of 99.80%, indicating a strong grasp of precise spatial semantics.
- Counting Tasks: Showed robust performance with 92.76% accuracy, benefiting from the integration of depth information for better object separation.
- Multiple-Choice Questions: Demonstrated 88.02% accuracy, reflecting a solid understanding of complex spatial queries.
- Distance Estimation: While the most challenging, it achieved 82.13% accuracy, significantly outperforming expectations for a model of its size without dedicated depth integration.
Beyond warehouse-specific tasks, SmolRGPT also showed competitive results on general qualitative spatial reasoning benchmarks, often matching or exceeding the performance of models like GPT-4 (1.76 trillion parameters) and LLaVA-v1.6-34B (34 billion parameters) on tasks like identifying ‘Behind/Front’ or ‘Tall/Short’ relationships. This efficiency is a major breakthrough, making advanced spatial AI deployable on consumer hardware and edge devices.
Also Read:
- Improving Text-to-Image Spatial Understanding Through Structured Information
- Enhancing Robot Manipulation Through Multi-View 3D Perception
The Future of Efficient Spatial AI
The work on SmolRGPT, led by Abdarahmane Traore, Eric Hervet, and Andy Couturier from Embia and Universit´e de Moncton, highlights a crucial step towards deployable multimodal intelligence. By carefully designing the architecture and training curriculum, SmolRGPT narrows the gap between compact models and very large vision-language models without the heavy computational overhead. While there are still areas for improvement, such as absolute size estimation, SmolRGPT paves the way for efficient and practical spatial AI in real-world, resource-constrained settings. You can find more details about this research in the paper available at arXiv:2509.15490.


