spot_img
HomeResearch & DevelopmentPelican-VL 1.0: A New Open-Source Brain for Real-World Robotics

Pelican-VL 1.0: A New Open-Source Brain for Real-World Robotics

TLDR: Pelican-VL 1.0 is a new family of open-source embodied brain models (7B-72B parameters) designed to bridge the gap between digital perception and real-world embodied cognition. It uses a novel training framework called Deliberate Practice Policy Optimization (DPPO), inspired by human metacognition, which iteratively refines skills and expands competence through a synergistic RL-SFT loop. This model achieves state-of-the-art performance on embodied benchmarks, demonstrates robust capabilities in tasks like tactile manipulation, affordance reasoning, multi-robot collaboration, and long-horizon planning, and is the largest open-source embodied multimodal brain model available.

The field of Artificial Intelligence is constantly pushing boundaries, and a significant leap has been made with the introduction of Pelican-VL 1.0. This new family of open-source embodied brain models aims to bridge the critical gap between digital perception and real-world embodied cognition, enabling AI to truly interact with our physical environment.

Developed by the WFM System Group and the Beijing Innovation Center of Humanoid Robotics (X-Humanoid), Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model available. It comes in various parameter scales, from 7 billion to 72 billion, making powerful intelligence accessible for diverse robotic embodiments. The core mission behind Pelican-VL 1.0 is to embed powerful intelligence into various physical forms, allowing them to perceive, reason, and interact with the world around us.

A Novel Training Approach: Deliberate Practice Policy Optimization (DPPO)

What sets Pelican-VL 1.0 apart is its innovative training framework, Deliberate Practice Policy Optimization (DPPO). Inspired by human metacognition – the ability to learn how to learn – DPPO employs a dynamic “metaloop” that teaches the AI to practice deliberately. This metaloop alternates between two synergistic phases: a Reinforcement Learning (RL) phase for skill refinement and autonomous weakness detection, and a Supervised Fine-Tuning (SFT) phase for consolidating knowledge and expanding competence.

This iterative process allows the model to continuously diagnose, target, and remediate its embodied weaknesses. The RL phase helps the model explore and strengthen brittle abilities, while the SFT phase absorbs and stabilizes these improvements. This ensures that Pelican-VL 1.0 not only learns from vast amounts of data but also intelligently refines its understanding and capabilities over time.

Massive Data and Robust Performance

Pelican-VL 1.0 is trained on a massive, high-quality dataset containing over 4 billion tokens, including 231 million images and 29,000 hours of video. This diverse data pool covers a wide range of tasks, from open-ended question-answering to grounding annotations and multiple-choice questions. The data is specifically curated to address challenges in embodied AI, focusing on physical, spatial, numerical reasoning, perception, grounding, multi-object consistency, temporal understanding, and decision-making.

The model’s performance is impressive, achieving state-of-the-art results on well-known embodied benchmarks. It demonstrates a significant performance uplift from its base model and even outperforms many 100B-level open-source counterparts. Crucially, the DPPO framework ensures stability during training, preventing issues like catastrophic forgetting while enhancing embodied capabilities. A new 9-dimension capability taxonomy was used to analyze models, revealing that Pelican-VL 1.0 achieves a comprehensively balanced capability profile across all dimensions, including critical areas like Decision and Task Planning and Scene and Action Understanding.

Also Read:

Real-World Applications and Embodied Intelligence

Pelican-VL 1.0 has been validated through extensive real-world experiments across various downstream applications:

  • Zero-shot Object Manipulation with Affordance: The model can perform complex pick-and-place tasks without prior fine-tuning or human demonstrations. It uses multi-view visual inputs to generate consistent affordances, allowing robots to understand how objects can be manipulated and execute actions effectively.
  • Closing the Sensorimotor Loop: It’s the first Vision-Language Model to close the sensorimotor loop by proactively predicting and continuously refining grasp force. This enables robust and gentle grasping of delicate and compliant objects, mimicking human-like tactile adaptation.
  • Embodied Function Call: Pelican-VL 1.0 facilitates multi-robot collaboration by decomposing complex system-level tasks into behavior-level plans and parameterized action function calls for different robotic embodiments. An industrial light bulb inspection pipeline was successfully constructed using this capability.
  • Long-Horizon Task Reasoning and Planning: The model can interpret natural-language instructions and autonomously complete sequences of manipulation and navigation actions in complex domestic environments, demonstrating coherent understanding of spatial relations and long-term task control.

The development of Pelican-VL 1.0 marks a significant step towards creating truly generalist robots. By open-sourcing the inference codebase, training codes, and base model checkpoints, the creators hope to empower the community to train and customize their own embodied brain models. This work lays a strong foundation for future advancements in autonomous, self-evolving intelligence, aiming for a “ChatGPT moment” for robotics. You can learn more about this groundbreaking research by reading the full paper here.

Ananya Rao
Ananya Raohttps://blogs.edgentiq.com
Ananya Rao is a tech journalist with a passion for dissecting the fast-moving world of Generative AI. With a background in computer science and a sharp editorial eye, she connects the dots between policy, innovation, and business. Ananya excels in real-time reporting and specializes in uncovering how startups and enterprises in India are navigating the GenAI boom. She brings urgency and clarity to every breaking news piece she writes. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -