Advancing Robot Learning: A New Model for Real-World Interaction

TLDR: Researchers have developed VLAC, a Vision-Language-Action-Critic model that significantly improves how robots learn tasks in the real world. By providing dense progress rewards and unifying action generation with task understanding, VLAC enables robots to learn complex manipulation tasks faster and more reliably, even adapting to new environments and benefiting from human guidance. It achieves high success rates and strong generalization across diverse real-world scenarios.

Robots are becoming increasingly capable, but teaching them to perform complex tasks in the real world remains a significant challenge. Traditional methods often rely on sparse, handcrafted rewards and inefficient exploration, making it difficult for robots to learn new skills effectively. A new research paper introduces a groundbreaking solution: the Vision-Language-Action-Critic (VLAC) model.

The VLAC model is designed to overcome these limitations by providing robots with a more intuitive way to understand progress and generate actions. Built upon a state-of-the-art multi-modal model called InternVL, VLAC is trained on vast and diverse datasets, including vision-language data, robot trajectories, and human demonstrations. This comprehensive training strengthens its perception, dialogue, and reasoning abilities, while also teaching it to estimate progress and generate appropriate actions.

One of VLAC’s key innovations is its ability to act as both a ‘critic’ and a ‘policy’ within a single architecture. As a critic, it can analyze a pair of observations and a language goal to output a ‘dense progress delta’ – essentially, a score indicating how much closer (or further) the robot is to completing its task. This eliminates the need for engineers to manually design rewards for each specific task, a process that is often time-consuming and non-generalizable. As a policy, VLAC directly generates action tokens, guiding the robot’s movements.

The model’s training involves several clever strategies to enhance its understanding of task progression. It uses a ‘pair-wise progress understanding’ method, comparing two images to determine relative advancement, regardless of the task’s starting point. This is bolstered by techniques like image difference filtering, joint sampling of forward and backward processes, and cross-sampling of task descriptions to ensure robust and semantically aligned learning. Crucially, VLAC also incorporates ‘in-context learning,’ allowing it to quickly adapt to new tasks and environments by learning from a single reference example.

To enable real-world learning, VLAC is integrated into an asynchronous reinforcement learning framework. This infrastructure ensures that robots can continuously interact with their environment, uploading observations and executing actions with minimal delay. The model then uses a policy optimization algorithm called Proximal Policy Optimization (PPO) to refine its actions based on the dense rewards provided by its critic component.

Recognizing that initial robot capabilities can be limited, the researchers also introduced a ‘human-in-the-loop’ protocol. This involves three levels of human intervention: ‘Offline Demonstration Replay’ (pre-populating a buffer with expert human data), ‘Return and Explore’ (manually resetting the robot to challenging states for targeted practice), and ‘Human Guided Explore’ (providing micro-demonstrations for specific behaviors). These interventions significantly accelerate exploration and stabilize early learning, leading to faster and more reliable skill acquisition.

Experiments across four distinct real-world manipulation tasks – Rice Scooping and Transfer, Unfold Mat, Pick and Place Bowl, and Desktop Sweep Disposal – demonstrated VLAC’s impressive capabilities. The model lifted success rates from approximately 30% to about 90% within 200 real-world interaction episodes. Incorporating human-in-the-loop interventions further boosted sample efficiency by 50% and achieved up to 100% final success rates. VLAC also showed strong generalization abilities, performing well even under challenging conditions like lighting disturbances and scene changes without requiring additional data collection.

Furthermore, the research explored multi-robot scaling, showing that increasing the number of robots can decrease the data required per robot to reach the same success level, highlighting the potential for efficient large-scale deployment. While the VLAC model represents a significant leap forward, the researchers acknowledge limitations, such as the heuristic nature of human intervention and the model’s current coupling to specific action interfaces. Future work aims to address these by formalizing intervention metrics and designing more architecture-agnostic progress/value bridging layers.

Also Read:

This work provides a practical blueprint for making large Vision-Language-Action models truly self-improving in the physical world, paving the way for more intelligent and adaptable robots. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Robot Learning: A New Model for Real-World Interaction

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates