TLDR: This research demonstrates the first successful deployment of Vision-Language-Action (VLA) models on a soft continuum robot. While VLA models typically control rigid robots, this study introduces a finetuning pipeline that allows state-of-the-art VLA models like OpenVLA-OFT and π0 to effectively control a soft robot named Embuddy. The findings show that out-of-the-box VLA policies fail on soft robots due to physical differences, but with targeted finetuning, the soft robot achieves performance comparable to rigid counterparts in manipulation and human-interactive tasks. This work highlights the potential for combining VLA intelligence with the inherent safety and flexibility of soft robots for human-centered environments.
Robotic systems are increasingly becoming a part of our daily lives, from manufacturing to assistance in homes. For these robots to truly integrate into human-centered environments, they need to be safe, adaptable, and capable of understanding complex instructions. This is where Vision-Language-Action (VLA) models come into play, offering a way for robots to perceive, understand language, and execute actions through a single intelligent system.
Historically, the deployment of VLA models has been largely confined to rigid, industrial-style robotic arms. While these arms offer predictable control, their inherent rigidity can pose safety concerns and limit their adaptability when interacting closely with humans or in unstructured environments. This research explores a crucial step forward: deploying VLA models on soft continuum manipulators, which are designed with compliant structures that deform upon interaction, offering intrinsic safety and resilience.
The core challenge lies in the significant difference, or ’embodiment gap,’ between rigid and soft robots. Policies trained on rigid arms often fail when applied directly to soft robots due to their non-linear, underactuated dynamics and unique morphology. This paper introduces a structured finetuning and deployment pipeline to bridge this gap, enabling VLA models to effectively control soft robots.
The researchers utilized a custom-designed soft continuum robot named Embuddy for their experiments. Embuddy features three modular sections, each with a revolute joint followed by a tendon-driven soft continuum segment made from 3D-printed Thermoplastic Polyurethane (TPU). Its underactuated sections and lightweight design (totaling 5kg) contribute to inherently safe interactions, as the arm remains deformable to external forces.
Two state-of-the-art VLA models, OpenVLA-OFT and Ï€0, were evaluated. The study involved three representative manipulation tasks: “Put the orange in the plate” (simple pick-and-place), “Put the X in the plate” (pick-and-place with choices, where X could be orange or milk), and “Feed the person with marshmallow” (a close human-interactive task).
Initial experiments confirmed that out-of-the-box VLA policies, without any specific training for soft robots, consistently failed. This was primarily due to the models generating motions suitable for rigid manipulators, which were incompatible with Embuddy’s unique kinematics and maximum bending angle constraints. This highlights the critical need for adaptation when transferring VLA intelligence to new robotic embodiments.
However, through targeted finetuning using a small, custom dataset of soft robot demonstrations, the adapted VLA policies achieved remarkable success. The finetuned OpenVLA-OFT model, for instance, achieved the exact same success rates on Task 1 and Task 2 as its rigid counterpart (a UR5 robot), demonstrating that the finetuning strategy successfully bridged the rigid-to-soft domain gap. While π0 also achieved high success rates after finetuning, OpenVLA-OFT showed superior performance on the compliant platform.
The research also delved into the robustness of these systems. The VLA models proved robust to human presence in the scene, maintaining focus on the workspace. Furthermore, Embuddy demonstrated impressive resilience during human interaction; when manually pushed away during an inference task, the soft robot was able to recover its original pose, continue its trajectory, and successfully complete the task without performance degradation.
Also Read:
- FALCON: Improving Robot Dexterity Through Advanced Spatial Perception
- Teaching Robots to Recover: A New Approach to Handling Unexpected Situations
This work marks a significant milestone, presenting the first systematic deployment of VLA models on a soft continuum robot. It unequivocally demonstrates that by addressing the embodiment mismatch through targeted finetuning, the advanced reasoning capabilities of VLA models can be effectively combined with the intrinsic safety and flexibility of soft robotics. This opens up a promising avenue for developing safe, adaptable, and intelligent embodied AI agents that can operate seamlessly in human-shared environments. For more details, you can read the full research paper here.


