Enhancing Robot Dexterity: How Phys2Real Combines Visual Intelligence and Interactive Learning for Real-World Tasks

TLDR: Phys2Real is a robotics framework that improves sim-to-real transfer for manipulation tasks by fusing visual estimates of physical properties from Vision-Language Models (VLMs) with online adaptation based on robot interactions. It uses uncertainty-aware weighting to combine these sources, leading to significantly higher success rates and faster task completion compared to traditional methods, especially for objects with complex dynamics like varying centers of mass.

Learning how robots can manipulate objects in the real world can be a very costly and time-consuming process. While training robots in simulated environments offers a more scalable solution, transferring these learned skills to the real world, especially for tasks requiring precise movements, remains a significant hurdle. This challenge is often referred to as the “sim-to-real gap.”

To tackle this, researchers have introduced Phys2Real, a novel approach that combines insights from vision-language models (VLMs) with interactive learning to help robots adapt to real-world objects. The core idea is to enable robots to make initial judgments about an object’s physical properties from its visual appearance, much like humans do, and then refine these estimates through interaction.

Phys2Real operates through a three-stage pipeline. First, it involves a “real-to-sim” reconstruction process. For objects without existing digital models, the system can create high-fidelity, simulation-ready 3D meshes directly from video frames. This is achieved by segmenting the object from images and then using advanced 3D reconstruction techniques like Gaussian Splatting to build a detailed digital twin.

The second stage focuses on “policy learning.” Here, deep reinforcement learning policies are trained in simulation to manipulate these digital objects. Crucially, these policies are not just trained to be robust to a wide range of parameters (a common technique called domain randomization), but they are specifically conditioned on interpretable physical parameters, such as an object’s center of mass (CoM). This allows the robot to learn optimal behaviors for different physical configurations. An optional fine-tuning phase also introduces noisy parameter estimates to make the policy more robust to real-world uncertainties.

The third and most innovative stage is “sim-to-real transfer,” which involves uncertainty-aware fusion. This is where Phys2Real truly shines. It combines two sources of information: initial physical parameter estimates from Vision-Language Models (VLMs) and online estimates derived from the robot’s real-world interactions. VLMs, like GPT-5, are queried with images of the object to provide an estimated CoM and an associated uncertainty. Simultaneously, an “adaptation model” learns to predict physical properties from the robot’s history of observations and actions during interaction. The system then intelligently fuses these two estimates using inverse-variance weighting, meaning it relies more on the VLM when interaction data is uncertain, and more on interaction data when the VLM’s visual estimate is less certain. This allows for continuous adaptation during tasks, even when contact with the object is intermittent.

The effectiveness of Phys2Real was tested on two challenging planar pushing tasks: manipulating a T-block with a varying center of mass and pushing a hammer with an off-center mass distribution. In the T-block pushing task, Phys2Real achieved a 100% success rate for a bottom-weighted T-block, significantly outperforming a domain randomization baseline (79%). For the more challenging top-weighted T-block, it achieved a 57% success rate compared to 23% for the baseline. On the hammer pushing task, while both Phys2Real and the baseline achieved 100% success, Phys2Real completed the task 15% faster on average, demonstrating more efficient trajectories.

A key finding from the research is that neither VLM estimates nor interactive adaptation alone were sufficient for high performance in challenging scenarios; the combination of both sources of information was essential for success. This highlights the power of integrating visual understanding with physical interaction for robust robotic manipulation.

Also Read:

This work represents a significant step towards more general and adaptive robotic systems that can learn from both perception and physical interaction, enabling them to handle novel objects in the real world with greater precision and efficiency. You can read the full research paper here: Phys2Real Research Paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Robot Dexterity: How Phys2Real Combines Visual Intelligence and Interactive Learning for Real-World Tasks

Gen AI News and Updates

Deductive AI Secures $7.5 Million Seed Funding to Revolutionize Software Reliability with Intelligent SRE Agents

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates