Hybrid Training: Enabling Fast and Intelligent Robots with Vision-Language-Action Models

TLDR: Hybrid Training (HyT) is a new framework for Vision-Language-Action (VLA) models that allows robots to learn from complex ‘Chain-of-Thought’ reasoning during training, internalizing knowledge for improved performance. Crucially, HyT enables the VLA to then execute actions directly and quickly during real-time operation, avoiding the inference slowdown typically associated with generating intermediate thoughts. This approach delivers high performance and fast inference, making VLAs more practical for real-world robotic tasks.

In the rapidly evolving field of robotics, Vision-Language-Action (VLA) models are paving the way for more generalist robots. These advanced models take language instructions and camera images as input, then output low-level robotic actions, enabling robots to perform complex tasks. However, a common challenge with these powerful models, especially those using ‘Chain-of-Thought’ (CoT) reasoning, has been a trade-off between performance and speed.

Traditional CoT strategies, where a VLA generates intermediate ‘thoughts’ before taking an action, have significantly boosted performance. This is similar to how humans might consciously deliberate before acting on a complex problem. While these ’embodied CoT’ (ECoT) methods improve a robot’s ability to understand and execute tasks, the generation of these intermediate thoughts adds to the model’s processing time, slowing down the robot’s actions. In real-world scenarios, particularly in robotic manipulation, delays can severely impact usability.

A new research paper introduces an innovative approach called Hybrid Training (HyT) that aims to resolve this dilemma. The core idea behind HyT is to allow VLAs to learn from these valuable ‘thoughts’ during training, internalizing the knowledge and performance benefits, without necessarily needing to generate them during real-time operation. This means the robot can still act quickly, much like a human developing ‘skilled intuition’ – where complex decisions become effortless over time due to learned patterns.

How Hybrid Training Works

HyT enables a single VLA model to learn multiple ways of generating outputs, depending on a ‘modality variable’. During training, the model is exposed to a mix of data, learning to:

Act Directly: Like a standard VLA, predicting actions immediately.
Think First: Similar to ECoT, generating intermediate thoughts before actions.
Follow Instructions: Acting as a low-level policy, following provided thoughts or instructions (e.g., from a human or another system).

By learning from this diverse set of objectives, the model internalizes a deeper understanding of tasks and environments. Crucially, at inference time, the model can be instructed to operate in an ‘act’ mode, directly predicting actions without generating intermediate thoughts. This allows HyT-trained VLAs to maintain the same fast inference speed as standard VLAs, while still benefiting from the knowledge acquired through CoT training.

The flexibility of HyT also means the model can still be used in ‘think’ mode for interpretability (to understand the robot’s intentions) or ‘follow’ mode for fine-grained instruction following, offering a versatile tool for robotic control.

Also Read:

Real-World Impact and Performance

The researchers rigorously tested HyT across various simulated benchmarks, including ClevrSkills and LIBERO, and in real-world robotic experiments. The results consistently showed that HyT-trained models not only outperform standard VLAs but also generally surpass models trained with ECoT and hierarchical VLA methods across different data scales. This performance boost was particularly evident in more complex tasks and in scenarios requiring generalization to new, out-of-distribution settings.

In real-world tests using a UFactory xArm 6, HyT demonstrated superior performance compared to OpenVLA, especially in tasks involving novel objects or placements. The HyT-trained robot showed greater precision in reaching picking and placing positions, avoiding common errors like reaching for the wrong object.

This research highlights that the true value of Chain-of-Thought techniques for VLAs lies in the enhanced understanding and representation learning they provide during training. By internalizing this reasoning, HyT allows robots to achieve higher performance with faster execution, making them more practical and efficient for real-world applications. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Hybrid Training: Enabling Fast and Intelligent Robots with Vision-Language-Action Models

How Hybrid Training Works

Real-World Impact and Performance

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates