spot_img
HomeResearch & DevelopmentiFlyBot-VLA: A New Approach to General Robot Control with...

iFlyBot-VLA: A New Approach to General Robot Control with Vision, Language, and Action

TLDR: iFlyBot-VLA is a new Vision-Language-Action (VLA) model that enables dual-arm robots to perform complex manipulation tasks. It uses a novel framework with a latent action model trained on diverse videos, a dual-level action representation for both high-level intent and low-level control, and a mixed training strategy combining robot data with spatial QA datasets. The model demonstrates superior performance in both simulated and real-world scenarios, excelling in tasks like pick-and-place, parcel sorting, and cloth folding, showcasing strong generalization and robust control.

A new research paper introduces iFlyBot-VLA, a significant advancement in Vision-Language-Action (VLA) models designed to control dual-arm robots. Developed by researchers from iFlyTek Research and Development Group and LindenBot, this model aims to bridge the gap between powerful AI perception and the precise, continuous control required for complex robotic manipulation tasks. The full technical details can be found in their paper: iFlyBot-VLA Technical Report.

Understanding iFlyBot-VLA’s Core Innovations

  • Latent Action Model: The team developed a model trained on a vast collection of human and robotic manipulation videos. This allows iFlyBot-VLA to learn “latent actions,” which are high-level, implicit intentions behind movements, rather than just raw control signals.
  • Dual-Level Action Representation: The system uses a unique framework that supervises both high-level latent actions and explicit low-level “structured discrete action tokens.” This dual approach helps the model align language, vision, and action representations, allowing the Vision-Language Model (VLM) to directly contribute to generating actions.
  • Mixed Training Strategy: To improve the robot’s 3D perception and reasoning, iFlyBot-VLA combines robot trajectory data with general question-answering (QA) and spatial QA datasets. This diverse training ensures the model maintains its broad understanding capabilities while learning specific manipulation skills.

How iFlyBot-VLA Works

At its heart, iFlyBot-VLA takes natural language instructions, multi-view images of the environment, and the robot’s current state as input. It then outputs “action chunks” to control a dual-arm robot. The model builds upon a pre-trained Vision-Language Model (VLM), specifically Qwen2.5-VL, for its strong perception and reasoning. To translate this understanding into physical actions, a “Flow-Matching Diffusion Transformer” acts as an action expert, generating continuous control signals.

The model’s design ensures that the VLM implicitly learns action-related semantics through discrete action tokens, while latent action tokens, derived from the specialized latent action model, provide a compact representation for efficient downstream action planning. This separation helps maintain the VLM’s general perception abilities while enabling the action expert to produce precise movements.

A Multi-Stage Training Approach

The training of iFlyBot-VLA is a sophisticated three-stage process:

  1. Latent Action Training: This initial stage focuses on teaching the model to extract high-level latent action representations from a large dataset of human and robot manipulation videos using a VQ-VAE-based architecture.
  2. Foundational Pre-training: Here, the goal is to build a robust foundation with broad spatial perception, object recognition, and generalization. The model learns to follow natural language commands and understand spatial relationships.
  3. Task-specific Post-training: For more intricate and precise operations, the model undergoes a final training phase using high-quality, self-collected datasets. This stage adapts iFlyBot-VLA to specific complex tasks like cloth folding or manipulating objects in cluttered scenes.

The training data includes a mix of internally developed spatial QA datasets, public datasets like OXE and AgiBot-World, and extensive self-collected data from iFLYTEK, featuring tasks like cloth folding, general pick-and-place, and long-horizon parcel sorting using 26 dual-arm robots.

Impressive Performance in Simulations and the Real World

iFlyBot-VLA’s capabilities were rigorously tested in both simulated and real-world environments.

  • LIBERO Simulator: In the LIBERO benchmark, iFlyBot-VLA achieved an average accuracy of 93.8% across various tasks, outperforming leading VLA models like Ï€0 (86%) and OpenVLA (76.5%). This demonstrates its strong generalization and performance in simulated robotic manipulation.
  • Real-World General Pick-and-Place: The model showed remarkable generalization to unseen objects, varying lighting conditions, and novel scenes, achieving high success rates (e.g., 96.04% with light illumination variations and 93.57% in unseen scenes).
  • Long-Horizon Manipulation (Parcel Sorting): For complex tasks like sorting deformable packages on a conveyor belt, iFlyBot-VLA demonstrated superior dual-arm coordination, achieving a 7.5% higher success rate than baselines when allowing for minor corrections.
  • Challenging Dual-Arm Manipulation (Cloth Folding): This task, involving highly deformable objects and precise grasping, highlighted iFlyBot-VLA’s robustness. The model showed strong performance in flattening and folding clothes, even from crumpled initial states.

Also Read:

Future Directions

While iFlyBot-VLA shows outstanding performance, the researchers acknowledge limitations, such as challenges with entirely novel instructions or unseen object shapes. Future work aims to scale the model, expand datasets, incorporate richer spatial representations, and integrate reinforcement learning to further enhance its generalization and robustness, moving beyond the inherent limitations of imitation learning.

In conclusion, iFlyBot-VLA represents a significant step towards developing general-purpose robotic systems capable of assisting humans in a wide array of daily tasks, contributing to the advancement of general robotic intelligence.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -