How Robots Get Smarter by Reflecting on Their Actions

TLDR: LITEN (Learning from Inference-Time Execution) is a new method that allows robots to learn from their experiences in the real world without additional training. It uses a high-level vision-language model (VLM) to plan tasks and a low-level vision-language-action (VLA) model to execute them. When the robot fails, a VLM “judge” assesses what went wrong and why, providing feedback that the VLM planner uses to refine its strategy for future attempts. This iterative process helps robots understand their own capabilities (affordances) and improve performance on complex, multi-step tasks.

Solving complex tasks in the real world often requires trial and error. If we fail the first time, we reflect on what went wrong and adjust our approach. This human-like ability to learn from mistakes is crucial for robots, especially those powered by Vision-Language-Action (VLA) models, which are designed to understand and execute commands.

However, current VLA models typically operate in a “single-shot” manner, meaning they are evaluated on their ability to follow individual commands without dynamically adjusting their behavior when faced with unexpected outcomes or failures. This limitation prevents them from tackling more intricate, long-horizon tasks that demand continuous adaptation.

A new research paper titled “Learning Affordances at Inference-Time for Vision-Language-Action Models” introduces an innovative method called LITEN (Learning from Inference-Time Execution). Developed by Ameesh Shah, William Chen, Adwait Godbole, Federico Mora, Sanjit A. Seshia, and Sergey Levine, LITEN empowers robots to learn from their real-world experiences without needing additional training. You can read the full paper here: Learning Affordances at Inference-Time for Vision-Language-Action Models.

How LITEN Works: A Two-Phase Approach

LITEN operates through an iterative, two-phase process: a reasoning phase and an assessment phase. This cycle allows the robot to progressively understand its own capabilities, known as “affordances,” and refine its task-solving strategies.

In the reasoning phase, a high-level Vision-Language Model (VLM) acts as the planner. Given a task, like “Empty two of the bowls,” the VLM breaks it down into a sequence of smaller, manageable subtasks. It then instructs a low-level VLA policy to execute these subtasks in the physical world. Crucially, this VLM planner considers insights gathered from previous attempts, which are included in its context.

Following execution, the system enters the assessment phase. Here, a VLM “judge” evaluates the outcome of each subtask. Unlike traditional methods that might rely on precise simulated feedback, LITEN’s judge must interpret unstructured real-world data, such as raw videos or images of the robot’s actions. It systematically determines if a subtask succeeded, what happened if it failed, why it failed, and what minimal changes could improve the chances of success in the future. These valuable conclusions are then fed back into the VLM planner for the next reasoning phase, allowing it to generate more effective plans.

Learning Without Extra Training

One of LITEN’s most significant contributions is its ability to learn affordances at inference time. This means the robot learns what it can and cannot do, given its physical embodiment, the environment’s constraints, and the VLA policy’s learned behaviors, all without any additional policy training. The high-level VLM essentially “feels out” the low-level policy’s capabilities, gradually strengthening its interface and improving its high-level task reasoning as it accumulates experience.

Real-World Performance and Insights

The researchers implemented LITEN using GPT-5-mini as the high-level VLM and π0.5-DROID, a state-of-the-art VLA, as the low-level policy. They tested LITEN on a DROID Franka robot setup across three challenging multi-stage tasks: Stacking, Emptying Bowls, and Moving Off Table. These tasks require the robot to understand complex interactions, such as which objects can be stacked without falling or which bowls are accessible to its gripper.

The experimental results demonstrated that LITEN consistently improved its success rates over consecutive attempts, effectively learning from both successes and failures. It significantly outperformed baseline approaches that either didn’t use feedback, only used positive examples, or relied on less structured reflection methods. For instance, LITEN learned that the VLA might be biased towards manipulating larger objects in the Stacking task or that certain objects were too difficult for precise control.

An ablation study further highlighted the importance of LITEN’s structured assessment process. Removing steps like failure reasoning or outcome analysis dramatically reduced performance, underscoring that detailed feedback is critical for meaningful learning.

Also Read:

Challenges and Future Directions

While LITEN marks a significant step forward, the researchers also identified areas for improvement. Failure cases sometimes arose from the inherent unpredictability of the VLA, misattributing control failures to language instructions, or a struggle to causally reason about the optimal order of subtasks. For example, placing one object might accidentally knock another off, a sequence that the VLM reasoner found difficult to anticipate and correct in future plans.

Despite these challenges, LITEN’s broad applicability is a key strength. It is hardware-agnostic and can be used with any off-the-shelf VLM and VLA, requiring only prompt adjustments for new robot setups. As VLM video comprehension capabilities and VLA language following improve, LITEN is expected to become even more powerful, enabling robots to solve increasingly complex tasks in the real world through continuous, inference-time learning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

How Robots Get Smarter by Reflecting on Their Actions

How LITEN Works: A Two-Phase Approach

Learning Without Extra Training

Real-World Performance and Insights

Challenges and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates