TLDR: Researchers have developed VLAC, a Vision-Language-Action-Critic model that significantly improves how robots learn tasks in the real world. By providing dense progress rewards and unifying action generation with task understanding, VLAC enables robots to learn complex manipulation tasks faster and more reliably, even adapting to new environments and benefiting from human guidance. It achieves high success rates and strong generalization across diverse real-world scenarios.
Robots are becoming increasingly capable, but teaching them to perform complex tasks in the real world remains a significant challenge. Traditional methods often rely on sparse, handcrafted rewards and inefficient exploration, making it difficult for robots to learn new skills effectively. A new research paper introduces a groundbreaking solution: the Vision-Language-Action-Critic (VLAC) model.
The VLAC model is designed to overcome these limitations by providing robots with a more intuitive way to understand progress and generate actions. Built upon a state-of-the-art multi-modal model called InternVL, VLAC is trained on vast and diverse datasets, including vision-language data, robot trajectories, and human demonstrations. This comprehensive training strengthens its perception, dialogue, and reasoning abilities, while also teaching it to estimate progress and generate appropriate actions.
One of VLAC’s key innovations is its ability to act as both a ‘critic’ and a ‘policy’ within a single architecture. As a critic, it can analyze a pair of observations and a language goal to output a ‘dense progress delta’ – essentially, a score indicating how much closer (or further) the robot is to completing its task. This eliminates the need for engineers to manually design rewards for each specific task, a process that is often time-consuming and non-generalizable. As a policy, VLAC directly generates action tokens, guiding the robot’s movements.
The model’s training involves several clever strategies to enhance its understanding of task progression. It uses a ‘pair-wise progress understanding’ method, comparing two images to determine relative advancement, regardless of the task’s starting point. This is bolstered by techniques like image difference filtering, joint sampling of forward and backward processes, and cross-sampling of task descriptions to ensure robust and semantically aligned learning. Crucially, VLAC also incorporates ‘in-context learning,’ allowing it to quickly adapt to new tasks and environments by learning from a single reference example.
To enable real-world learning, VLAC is integrated into an asynchronous reinforcement learning framework. This infrastructure ensures that robots can continuously interact with their environment, uploading observations and executing actions with minimal delay. The model then uses a policy optimization algorithm called Proximal Policy Optimization (PPO) to refine its actions based on the dense rewards provided by its critic component.
Recognizing that initial robot capabilities can be limited, the researchers also introduced a ‘human-in-the-loop’ protocol. This involves three levels of human intervention: ‘Offline Demonstration Replay’ (pre-populating a buffer with expert human data), ‘Return and Explore’ (manually resetting the robot to challenging states for targeted practice), and ‘Human Guided Explore’ (providing micro-demonstrations for specific behaviors). These interventions significantly accelerate exploration and stabilize early learning, leading to faster and more reliable skill acquisition.
Experiments across four distinct real-world manipulation tasks – Rice Scooping and Transfer, Unfold Mat, Pick and Place Bowl, and Desktop Sweep Disposal – demonstrated VLAC’s impressive capabilities. The model lifted success rates from approximately 30% to about 90% within 200 real-world interaction episodes. Incorporating human-in-the-loop interventions further boosted sample efficiency by 50% and achieved up to 100% final success rates. VLAC also showed strong generalization abilities, performing well even under challenging conditions like lighting disturbances and scene changes without requiring additional data collection.
Furthermore, the research explored multi-robot scaling, showing that increasing the number of robots can decrease the data required per robot to reach the same success level, highlighting the potential for efficient large-scale deployment. While the VLAC model represents a significant leap forward, the researchers acknowledge limitations, such as the heuristic nature of human intervention and the model’s current coupling to specific action interfaces. Future work aims to address these by formalizing intervention metrics and designing more architecture-agnostic progress/value bridging layers.
Also Read:
- Scene Graphs Enable Robots to Master Complex Tasks with Focused Learning
- Enhancing Reinforcement Learning with Adaptive Demonstration Guidance
This work provides a practical blueprint for making large Vision-Language-Action models truly self-improving in the physical world, paving the way for more intelligent and adaptable robots. You can read the full research paper here.


