RynnEC: A Video Multimodal Model for Embodied Cognition

TLDR: RynnEC is a new video multimodal large language model (MLLM) designed for embodied cognition. It uses a region encoder and mask decoder for flexible, fine-grained video interaction, achieving state-of-the-art performance in object understanding, segmentation, and spatial reasoning. To overcome data limitations, it employs an egocentric video data generation pipeline and introduces RynnEC-Bench for evaluation. RynnEC aims to be a general-purpose cognitive core for embodied agents, enabling more precise interactions in the physical world.

In the rapidly evolving landscape of artificial intelligence, the ability for machines to perceive and interact with the physical world, known as embodied cognition, is a frontier of significant research. A new development from DAMO Academy, Alibaba Group, in collaboration with Hupan Lab and Zhejiang University, introduces RynnEC, a video multimodal large language model (MLLM) specifically engineered for this challenging domain.

RynnEC stands out by integrating a region encoder and a mask decoder into a general-purpose vision-language foundation model. This unique architecture allows for highly flexible and precise region-level video interaction. Unlike traditional MLLMs that might struggle with the nuances of physical world perception, RynnEC is designed to offer fine-grained understanding, making it a powerful “brain” for embodied agents.

Despite its compact design, RynnEC has demonstrated impressive capabilities, achieving state-of-the-art performance across crucial areas such as object property understanding, accurate object segmentation, and sophisticated spatial reasoning. This region-centric approach to video processing provides a new paradigm for how embodied agents can interpret their surroundings, leading to more accurate and effective interactions.

A significant hurdle in developing embodied cognition models is the scarcity of annotated 3D datasets. To overcome this, the researchers behind RynnEC have devised an innovative egocentric video-based pipeline for generating high-quality embodied cognition data. This method transforms raw egocentric RGB videos into rich question-answering datasets, making data expansion more cost-effective and scalable.

To rigorously evaluate these advanced capabilities, the team also introduced RynnEC-Bench, a comprehensive, region-centered benchmark. This benchmark assesses 22 distinct tasks spanning both object and spatial cognition, providing a robust framework for evaluating embodied understanding models in open-world scenarios. The RynnEC-Bench is designed to reflect real-world object frequencies and includes diverse question types, from numerical and textual to segmentation-based queries.

The training of RynnEC follows a progressive four-stage pipeline: Mask Alignment, Object Understanding, Spatial Understanding, and Referring Segmentation. This curriculum-based approach incrementally enhances the model’s fine-grained, object-centric understanding, ensuring a gradual integration of visual, spatial, and grounding knowledge without compromising its overall performance.

Experimental results highlight RynnEC’s remarkable performance. Even with only 7 billion parameters, it surpasses leading proprietary models like Gemini-2.5 Pro in overall embodied cognitive abilities. Its balanced and superior performance across various tasks, particularly in spatial cognition, underscores its potential. A smaller 2-billion parameter version also offers near-parity performance, making it suitable for resource-constrained environments and on-device deployment.

RynnEC’s generalization capabilities have also been validated on purely textual spatial intelligence benchmarks like VSI-Bench, demonstrating that its spatial awareness can effectively transfer across different modalities. This suggests that robust foundational spatial cognition is key to achieving superior performance in high-level planning and decision-making tasks for embodied agents.

The practical implications of RynnEC are vast. It holds significant promise in assisting robots with complex, long-horizon tasks in intricate environments. Its fine-grained object localization, understanding, direction and distance perception, spatial scale estimation, and counting abilities empower robots to perform more delicate manipulations and navigate efficiently. This advancement paves the way for more valuable real-world applications of embodied intelligence.

Also Read:

The researchers view RynnEC as a foundational step towards a general embodied intelligence model. Future work aims to further enhance its reasoning capabilities by integrating its diverse skills for joint reasoning and to develop a unified perception and planning framework, ultimately forming a closed-loop embodied system. For more technical details, you can refer to the full research paper available at arXiv.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

RynnEC: A Video Multimodal Model for Embodied Cognition

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates