TLDR: RynnEC is a new video multimodal large language model (MLLM) designed for embodied cognition. It uses a region encoder and mask decoder for flexible, fine-grained video interaction, achieving state-of-the-art performance in object understanding, segmentation, and spatial reasoning. To overcome data limitations, it employs an egocentric video data generation pipeline and introduces RynnEC-Bench for evaluation. RynnEC aims to be a general-purpose cognitive core for embodied agents, enabling more precise interactions in the physical world.
In the rapidly evolving landscape of artificial intelligence, the ability for machines to perceive and interact with the physical world, known as embodied cognition, is a frontier of significant research. A new development from DAMO Academy, Alibaba Group, in collaboration with Hupan Lab and Zhejiang University, introduces RynnEC, a video multimodal large language model (MLLM) specifically engineered for this challenging domain.
RynnEC stands out by integrating a region encoder and a mask decoder into a general-purpose vision-language foundation model. This unique architecture allows for highly flexible and precise region-level video interaction. Unlike traditional MLLMs that might struggle with the nuances of physical world perception, RynnEC is designed to offer fine-grained understanding, making it a powerful “brain” for embodied agents.
Despite its compact design, RynnEC has demonstrated impressive capabilities, achieving state-of-the-art performance across crucial areas such as object property understanding, accurate object segmentation, and sophisticated spatial reasoning. This region-centric approach to video processing provides a new paradigm for how embodied agents can interpret their surroundings, leading to more accurate and effective interactions.
A significant hurdle in developing embodied cognition models is the scarcity of annotated 3D datasets. To overcome this, the researchers behind RynnEC have devised an innovative egocentric video-based pipeline for generating high-quality embodied cognition data. This method transforms raw egocentric RGB videos into rich question-answering datasets, making data expansion more cost-effective and scalable.
To rigorously evaluate these advanced capabilities, the team also introduced RynnEC-Bench, a comprehensive, region-centered benchmark. This benchmark assesses 22 distinct tasks spanning both object and spatial cognition, providing a robust framework for evaluating embodied understanding models in open-world scenarios. The RynnEC-Bench is designed to reflect real-world object frequencies and includes diverse question types, from numerical and textual to segmentation-based queries.
The training of RynnEC follows a progressive four-stage pipeline: Mask Alignment, Object Understanding, Spatial Understanding, and Referring Segmentation. This curriculum-based approach incrementally enhances the model’s fine-grained, object-centric understanding, ensuring a gradual integration of visual, spatial, and grounding knowledge without compromising its overall performance.
Experimental results highlight RynnEC’s remarkable performance. Even with only 7 billion parameters, it surpasses leading proprietary models like Gemini-2.5 Pro in overall embodied cognitive abilities. Its balanced and superior performance across various tasks, particularly in spatial cognition, underscores its potential. A smaller 2-billion parameter version also offers near-parity performance, making it suitable for resource-constrained environments and on-device deployment.
RynnEC’s generalization capabilities have also been validated on purely textual spatial intelligence benchmarks like VSI-Bench, demonstrating that its spatial awareness can effectively transfer across different modalities. This suggests that robust foundational spatial cognition is key to achieving superior performance in high-level planning and decision-making tasks for embodied agents.
The practical implications of RynnEC are vast. It holds significant promise in assisting robots with complex, long-horizon tasks in intricate environments. Its fine-grained object localization, understanding, direction and distance perception, spatial scale estimation, and counting abilities empower robots to perform more delicate manipulations and navigate efficiently. This advancement paves the way for more valuable real-world applications of embodied intelligence.
Also Read:
- Embodied-R1: Advancing Robotic Manipulation with Reinforced Visual Reasoning
- Advancing AI’s Visual Understanding with Region-Level Context
The researchers view RynnEC as a foundational step towards a general embodied intelligence model. Future work aims to further enhance its reasoning capabilities by integrating its diverse skills for joint reasoning and to develop a unified perception and planning framework, ultimately forming a closed-loop embodied system. For more technical details, you can refer to the full research paper available at arXiv.


