Pelican-VL 1.0: A New Open-Source Brain for Real-World Robotics

TLDR: Pelican-VL 1.0 is a new family of open-source embodied brain models (7B-72B parameters) designed to bridge the gap between digital perception and real-world embodied cognition. It uses a novel training framework called Deliberate Practice Policy Optimization (DPPO), inspired by human metacognition, which iteratively refines skills and expands competence through a synergistic RL-SFT loop. This model achieves state-of-the-art performance on embodied benchmarks, demonstrates robust capabilities in tasks like tactile manipulation, affordance reasoning, multi-robot collaboration, and long-horizon planning, and is the largest open-source embodied multimodal brain model available.

The field of Artificial Intelligence is constantly pushing boundaries, and a significant leap has been made with the introduction of Pelican-VL 1.0. This new family of open-source embodied brain models aims to bridge the critical gap between digital perception and real-world embodied cognition, enabling AI to truly interact with our physical environment.

Developed by the WFM System Group and the Beijing Innovation Center of Humanoid Robotics (X-Humanoid), Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model available. It comes in various parameter scales, from 7 billion to 72 billion, making powerful intelligence accessible for diverse robotic embodiments. The core mission behind Pelican-VL 1.0 is to embed powerful intelligence into various physical forms, allowing them to perceive, reason, and interact with the world around us.

A Novel Training Approach: Deliberate Practice Policy Optimization (DPPO)

What sets Pelican-VL 1.0 apart is its innovative training framework, Deliberate Practice Policy Optimization (DPPO). Inspired by human metacognition – the ability to learn how to learn – DPPO employs a dynamic “metaloop” that teaches the AI to practice deliberately. This metaloop alternates between two synergistic phases: a Reinforcement Learning (RL) phase for skill refinement and autonomous weakness detection, and a Supervised Fine-Tuning (SFT) phase for consolidating knowledge and expanding competence.

This iterative process allows the model to continuously diagnose, target, and remediate its embodied weaknesses. The RL phase helps the model explore and strengthen brittle abilities, while the SFT phase absorbs and stabilizes these improvements. This ensures that Pelican-VL 1.0 not only learns from vast amounts of data but also intelligently refines its understanding and capabilities over time.

Massive Data and Robust Performance

Pelican-VL 1.0 is trained on a massive, high-quality dataset containing over 4 billion tokens, including 231 million images and 29,000 hours of video. This diverse data pool covers a wide range of tasks, from open-ended question-answering to grounding annotations and multiple-choice questions. The data is specifically curated to address challenges in embodied AI, focusing on physical, spatial, numerical reasoning, perception, grounding, multi-object consistency, temporal understanding, and decision-making.

The model’s performance is impressive, achieving state-of-the-art results on well-known embodied benchmarks. It demonstrates a significant performance uplift from its base model and even outperforms many 100B-level open-source counterparts. Crucially, the DPPO framework ensures stability during training, preventing issues like catastrophic forgetting while enhancing embodied capabilities. A new 9-dimension capability taxonomy was used to analyze models, revealing that Pelican-VL 1.0 achieves a comprehensively balanced capability profile across all dimensions, including critical areas like Decision and Task Planning and Scene and Action Understanding.

Also Read:

Real-World Applications and Embodied Intelligence

Pelican-VL 1.0 has been validated through extensive real-world experiments across various downstream applications:

Zero-shot Object Manipulation with Affordance: The model can perform complex pick-and-place tasks without prior fine-tuning or human demonstrations. It uses multi-view visual inputs to generate consistent affordances, allowing robots to understand how objects can be manipulated and execute actions effectively.
Closing the Sensorimotor Loop: It’s the first Vision-Language Model to close the sensorimotor loop by proactively predicting and continuously refining grasp force. This enables robust and gentle grasping of delicate and compliant objects, mimicking human-like tactile adaptation.
Embodied Function Call: Pelican-VL 1.0 facilitates multi-robot collaboration by decomposing complex system-level tasks into behavior-level plans and parameterized action function calls for different robotic embodiments. An industrial light bulb inspection pipeline was successfully constructed using this capability.
Long-Horizon Task Reasoning and Planning: The model can interpret natural-language instructions and autonomously complete sequences of manipulation and navigation actions in complex domestic environments, demonstrating coherent understanding of spatial relations and long-term task control.

The development of Pelican-VL 1.0 marks a significant step towards creating truly generalist robots. By open-sourcing the inference codebase, training codes, and base model checkpoints, the creators hope to empower the community to train and customize their own embodied brain models. This work lays a strong foundation for future advancements in autonomous, self-evolving intelligence, aiming for a “ChatGPT moment” for robotics. You can learn more about this groundbreaking research by reading the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Pelican-VL 1.0: A New Open-Source Brain for Real-World Robotics

A Novel Training Approach: Deliberate Practice Policy Optimization (DPPO)

Massive Data and Robust Performance

Real-World Applications and Embodied Intelligence

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates