Exploring the Capabilities of Octopi-1.5: A Visual-Tactile-Language Model for Robots

TLDR: Octopi-1.5 is a new visual-tactile-language model for robots that improves upon its predecessor by processing tactile signals from multiple object parts and using retrieval-augmented generation (RAG). It’s demonstrated with a handheld interface (TMI) in tasks like object identification and sorting, showcasing enhanced tactile inference and the ability to learn new objects on-the-fly.

In the evolving world of robotics, the sense of touch is proving to be as crucial for machines as it is for humans. Just as we rely on touch for delicate tasks, identifying materials, or navigating in low visibility, robots too can benefit immensely from this sensory input. Building on recent advancements in touch foundation models, researchers have introduced Octopi-1.5, a cutting-edge visual-tactile-language model designed to enhance robotic perception and interaction.

What is Octopi-1.5?

Octopi-1.5 is an advanced model that integrates visual, tactile, and language information. It’s an improvement over its predecessor, bringing several key enhancements. One significant upgrade is its ability to process tactile signals from multiple parts of an object, allowing for a more comprehensive understanding of what it’s touching. Additionally, it incorporates a simple retrieval-augmented generation (RAG) module. This RAG module helps Octopi-1.5 improve its performance on various tasks and even learn about new objects on the fly by retrieving information about similar items it has encountered before.

How Does Octopi-1.5 Work?

At its core, Octopi-1.5 is built upon the Qwen2-VL 7B open-source vision-language model. It uses a specialized tactile encoder, which is a fine-tuned CLIP module, to translate raw tactile data from sensors into a format the model can understand. This encoder was trained on a large dataset combining existing GelSight data with an expanded PhysiCLeAR dataset, as well as hardness and ObjectFolder datasets. The training process involved two stages: first, training the CLIP module to predict human-annotated hardness and roughness scores, and then fine-tuning the entire model for description and ranking tasks. The RAG module further enhances its descriptions by finding and presenting information about similar objects, making its inferences more robust and informative.

The Tactile Manipulation Interface (TMI)

To demonstrate Octopi-1.5’s capabilities, the researchers developed a special handheld interface called the Tactile Manipulation Interface (TMI). This device is a modified Universal Manipulation Interface equipped with two types of tactile sensors: a GelSight Mini sensor, which provides high-resolution tactile images, and a TAC-02 piezoresistive sensor, which directly captures pressure readings. This portable and accessible setup allows users to interact with Octopi-1.5 without needing a complex robotic arm, making it ideal for live demonstrations and broader exploration of tactile sensing.

Demonstrating Octopi-1.5’s Capabilities

The researchers plan to showcase Octopi-1.5 through several interactive scenarios. The setup is highly portable, primarily consisting of the TMI and a laptop connected to a remote server running Octopi-1.5. Users can interact with the system through a chat interface, providing natural language prompts and physical tactile inputs.

The Guessing Game

One of the main demonstrations is a “Guessing Game.” In this scenario, Octopi-1.5 is presented with a set of objects, either visually or through language. Users then grasp an object with the TMI, providing only tactile input. Octopi-1.5 then infers which object was touched, leveraging both the tactile data and its commonsense knowledge. For example, it can identify a grasped apple and even offer advice on how to handle it, such as recommending careful handling for soft fruits. Preliminary tests show high accuracy, especially with the RAG module enabled, and the model can even be “taught” new items on the fly, further boosting its performance.

Sorting Objects by Touch

Another demonstration involves a “Sorting Task,” where Octopi-1.5 distinguishes objects based on their tactile properties, such as hardness. Users provide tactile inputs for a set of items, and Octopi-1.5 then sorts them accordingly. This capability has practical applications in classifying fruits or other items based on their material characteristics. While highly effective for familiar objects, the demonstration also aims to highlight current limitations, such as challenges with unseen objects or sorting by more nuanced properties like roughness.

Also Read:

Free Interaction and Future Directions

Beyond structured tasks, users will have the opportunity for “Free Interaction” with Octopi-1.5. This allows for open-ended exploration, where participants can provide both visual and tactile inputs, engage in unconstrained chat, and even test experimental features like teaching the model new items. This segment aims to uncover new insights into the system’s strengths and weaknesses. While Octopi-1.5 shows promising results, researchers acknowledge limitations such as generalization challenges and the need for larger base language models. Future work includes exploring more advanced RAG setups, incorporating tactile sample retrieval, and ultimately linking Octopi-1.5 to direct robot manipulation tasks, moving towards a Vision-Tactile-Language-Action (VTLA) model. For more technical details, you can refer to the full research paper available here.