iFlyBot-VLA: A New Approach to General Robot Control with Vision, Language, and Action

TLDR: iFlyBot-VLA is a new Vision-Language-Action (VLA) model that enables dual-arm robots to perform complex manipulation tasks. It uses a novel framework with a latent action model trained on diverse videos, a dual-level action representation for both high-level intent and low-level control, and a mixed training strategy combining robot data with spatial QA datasets. The model demonstrates superior performance in both simulated and real-world scenarios, excelling in tasks like pick-and-place, parcel sorting, and cloth folding, showcasing strong generalization and robust control.

A new research paper introduces iFlyBot-VLA, a significant advancement in Vision-Language-Action (VLA) models designed to control dual-arm robots. Developed by researchers from iFlyTek Research and Development Group and LindenBot, this model aims to bridge the gap between powerful AI perception and the precise, continuous control required for complex robotic manipulation tasks. The full technical details can be found in their paper: iFlyBot-VLA Technical Report.

Understanding iFlyBot-VLA’s Core Innovations

Latent Action Model: The team developed a model trained on a vast collection of human and robotic manipulation videos. This allows iFlyBot-VLA to learn “latent actions,” which are high-level, implicit intentions behind movements, rather than just raw control signals.
Dual-Level Action Representation: The system uses a unique framework that supervises both high-level latent actions and explicit low-level “structured discrete action tokens.” This dual approach helps the model align language, vision, and action representations, allowing the Vision-Language Model (VLM) to directly contribute to generating actions.
Mixed Training Strategy: To improve the robot’s 3D perception and reasoning, iFlyBot-VLA combines robot trajectory data with general question-answering (QA) and spatial QA datasets. This diverse training ensures the model maintains its broad understanding capabilities while learning specific manipulation skills.

How iFlyBot-VLA Works

At its heart, iFlyBot-VLA takes natural language instructions, multi-view images of the environment, and the robot’s current state as input. It then outputs “action chunks” to control a dual-arm robot. The model builds upon a pre-trained Vision-Language Model (VLM), specifically Qwen2.5-VL, for its strong perception and reasoning. To translate this understanding into physical actions, a “Flow-Matching Diffusion Transformer” acts as an action expert, generating continuous control signals.

The model’s design ensures that the VLM implicitly learns action-related semantics through discrete action tokens, while latent action tokens, derived from the specialized latent action model, provide a compact representation for efficient downstream action planning. This separation helps maintain the VLM’s general perception abilities while enabling the action expert to produce precise movements.

A Multi-Stage Training Approach

The training of iFlyBot-VLA is a sophisticated three-stage process:

Latent Action Training: This initial stage focuses on teaching the model to extract high-level latent action representations from a large dataset of human and robot manipulation videos using a VQ-VAE-based architecture.
Foundational Pre-training: Here, the goal is to build a robust foundation with broad spatial perception, object recognition, and generalization. The model learns to follow natural language commands and understand spatial relationships.
Task-specific Post-training: For more intricate and precise operations, the model undergoes a final training phase using high-quality, self-collected datasets. This stage adapts iFlyBot-VLA to specific complex tasks like cloth folding or manipulating objects in cluttered scenes.

The training data includes a mix of internally developed spatial QA datasets, public datasets like OXE and AgiBot-World, and extensive self-collected data from iFLYTEK, featuring tasks like cloth folding, general pick-and-place, and long-horizon parcel sorting using 26 dual-arm robots.

Impressive Performance in Simulations and the Real World

iFlyBot-VLA’s capabilities were rigorously tested in both simulated and real-world environments.

LIBERO Simulator: In the LIBERO benchmark, iFlyBot-VLA achieved an average accuracy of 93.8% across various tasks, outperforming leading VLA models like π0 (86%) and OpenVLA (76.5%). This demonstrates its strong generalization and performance in simulated robotic manipulation.
Real-World General Pick-and-Place: The model showed remarkable generalization to unseen objects, varying lighting conditions, and novel scenes, achieving high success rates (e.g., 96.04% with light illumination variations and 93.57% in unseen scenes).
Long-Horizon Manipulation (Parcel Sorting): For complex tasks like sorting deformable packages on a conveyor belt, iFlyBot-VLA demonstrated superior dual-arm coordination, achieving a 7.5% higher success rate than baselines when allowing for minor corrections.
Challenging Dual-Arm Manipulation (Cloth Folding): This task, involving highly deformable objects and precise grasping, highlighted iFlyBot-VLA’s robustness. The model showed strong performance in flattening and folding clothes, even from crumpled initial states.

Also Read:

Future Directions

While iFlyBot-VLA shows outstanding performance, the researchers acknowledge limitations, such as challenges with entirely novel instructions or unseen object shapes. Future work aims to scale the model, expand datasets, incorporate richer spatial representations, and integrate reinforcement learning to further enhance its generalization and robustness, moving beyond the inherent limitations of imitation learning.

In conclusion, iFlyBot-VLA represents a significant step towards developing general-purpose robotic systems capable of assisting humans in a wide array of daily tasks, contributing to the advancement of general robotic intelligence.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

iFlyBot-VLA: A New Approach to General Robot Control with Vision, Language, and Action

Understanding iFlyBot-VLA’s Core Innovations

How iFlyBot-VLA Works

A Multi-Stage Training Approach

Impressive Performance in Simulations and the Real World

Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates