TLDR: LITEN (Learning from Inference-Time Execution) is a new method that allows robots to learn from their experiences in the real world without additional training. It uses a high-level vision-language model (VLM) to plan tasks and a low-level vision-language-action (VLA) model to execute them. When the robot fails, a VLM “judge” assesses what went wrong and why, providing feedback that the VLM planner uses to refine its strategy for future attempts. This iterative process helps robots understand their own capabilities (affordances) and improve performance on complex, multi-step tasks.
Solving complex tasks in the real world often requires trial and error. If we fail the first time, we reflect on what went wrong and adjust our approach. This human-like ability to learn from mistakes is crucial for robots, especially those powered by Vision-Language-Action (VLA) models, which are designed to understand and execute commands.
However, current VLA models typically operate in a “single-shot” manner, meaning they are evaluated on their ability to follow individual commands without dynamically adjusting their behavior when faced with unexpected outcomes or failures. This limitation prevents them from tackling more intricate, long-horizon tasks that demand continuous adaptation.
A new research paper titled “Learning Affordances at Inference-Time for Vision-Language-Action Models” introduces an innovative method called LITEN (Learning from Inference-Time Execution). Developed by Ameesh Shah, William Chen, Adwait Godbole, Federico Mora, Sanjit A. Seshia, and Sergey Levine, LITEN empowers robots to learn from their real-world experiences without needing additional training. You can read the full paper here: Learning Affordances at Inference-Time for Vision-Language-Action Models.
How LITEN Works: A Two-Phase Approach
LITEN operates through an iterative, two-phase process: a reasoning phase and an assessment phase. This cycle allows the robot to progressively understand its own capabilities, known as “affordances,” and refine its task-solving strategies.
In the reasoning phase, a high-level Vision-Language Model (VLM) acts as the planner. Given a task, like “Empty two of the bowls,” the VLM breaks it down into a sequence of smaller, manageable subtasks. It then instructs a low-level VLA policy to execute these subtasks in the physical world. Crucially, this VLM planner considers insights gathered from previous attempts, which are included in its context.
Following execution, the system enters the assessment phase. Here, a VLM “judge” evaluates the outcome of each subtask. Unlike traditional methods that might rely on precise simulated feedback, LITEN’s judge must interpret unstructured real-world data, such as raw videos or images of the robot’s actions. It systematically determines if a subtask succeeded, what happened if it failed, why it failed, and what minimal changes could improve the chances of success in the future. These valuable conclusions are then fed back into the VLM planner for the next reasoning phase, allowing it to generate more effective plans.
Learning Without Extra Training
One of LITEN’s most significant contributions is its ability to learn affordances at inference time. This means the robot learns what it can and cannot do, given its physical embodiment, the environment’s constraints, and the VLA policy’s learned behaviors, all without any additional policy training. The high-level VLM essentially “feels out” the low-level policy’s capabilities, gradually strengthening its interface and improving its high-level task reasoning as it accumulates experience.
Real-World Performance and Insights
The researchers implemented LITEN using GPT-5-mini as the high-level VLM and π0.5-DROID, a state-of-the-art VLA, as the low-level policy. They tested LITEN on a DROID Franka robot setup across three challenging multi-stage tasks: Stacking, Emptying Bowls, and Moving Off Table. These tasks require the robot to understand complex interactions, such as which objects can be stacked without falling or which bowls are accessible to its gripper.
The experimental results demonstrated that LITEN consistently improved its success rates over consecutive attempts, effectively learning from both successes and failures. It significantly outperformed baseline approaches that either didn’t use feedback, only used positive examples, or relied on less structured reflection methods. For instance, LITEN learned that the VLA might be biased towards manipulating larger objects in the Stacking task or that certain objects were too difficult for precise control.
An ablation study further highlighted the importance of LITEN’s structured assessment process. Removing steps like failure reasoning or outcome analysis dramatically reduced performance, underscoring that detailed feedback is critical for meaningful learning.
Also Read:
- Robots That Do What They Say: A New Approach to Action Verification
- Teaching Robots to Recover: A New Approach to Handling Unexpected Situations
Challenges and Future Directions
While LITEN marks a significant step forward, the researchers also identified areas for improvement. Failure cases sometimes arose from the inherent unpredictability of the VLA, misattributing control failures to language instructions, or a struggle to causally reason about the optimal order of subtasks. For example, placing one object might accidentally knock another off, a sequence that the VLM reasoner found difficult to anticipate and correct in future plans.
Despite these challenges, LITEN’s broad applicability is a key strength. It is hardware-agnostic and can be used with any off-the-shelf VLM and VLA, requiring only prompt adjustments for new robot setups. As VLM video comprehension capabilities and VLA language following improve, LITEN is expected to become even more powerful, enabling robots to solve increasingly complex tasks in the real world through continuous, inference-time learning.


