RICL: Enhancing Robot Learning with In-Context Adaptability

TLDR: The research paper introduces RICL (Retraining for In-Context Learning), a method to inject in-context learning (ICL) abilities into pre-trained Vision-Language-Action (VLA) models for robotics. Unlike traditional VLAs that require extensive retraining for new tasks, RICL enables them to adapt to unseen objects, novel motions, and new environments using only a small number of demonstrations as context, without parameter updates. The RICL-enhanced VLA, specifically RICL-π0-FAST-DROID, significantly outperforms baseline models in task success and shows even greater improvements when further fine-tuned on task-specific data.

Robotics is undergoing a significant transformation with the emergence of general-purpose Vision-Language-Action (VLA) models. These models are designed to understand visual information, language instructions, and execute physical actions, holding immense promise for tackling complex robotic tasks. However, a key challenge has been their inability to easily adapt to new tasks without extensive retraining, a feature known as in-context learning (ICL) that is common in large language models (LLMs).

Unlike LLMs, which naturally acquire ICL abilities from their vast training data, VLAs trained through imitation learning typically do not. This means that to teach a VLA a new skill, users often have to go through a cumbersome process of fine-tuning its parameters with new demonstration datasets. This paper introduces a novel approach called Retraining for In-Context Learning (RICL), which aims to inject this crucial adaptability into pre-trained VLA models.

RICL works by post-training an existing VLA, such as the state-of-the-art π0-FAST-DROID model, using a specific recipe and a small dataset of robot demonstrations. The core idea is to enable the VLA to leverage in-context learning, similar to how Retrieval-Augmented Generation (RAG) enhances LLMs. When a user provides a small number of demonstrations (typically 10-20) for a new task, RICL fetches the most relevant parts of these demonstrations and integrates them into the VLA’s context. This allows the VLA to perform the new task and significantly improve its performance without any parameter updates.

The RICL architecture involves fine-tuning only the language model component of the VLA while keeping the image encoder frozen. It uses an action interpolation layer that combines the actions from the closest retrieved demonstration with the VLA’s own predictions, effectively blending learned behaviors with new contextual information. This process primes the VLA to effectively use its context for adaptation.

The researchers applied RICL to the π0-FAST VLA and conducted extensive evaluations on a variety of new manipulation tasks. These tasks included handling unseen objects, performing novel motions, and operating in new environments like a kitchen sink. The results were compelling: RICL-π0-FAST-DROID showed a dramatic improvement in task success rates compared to the baseline π0-FAST-DROID. For instance, the RICL-enhanced model achieved a complete task success rate of 31.25% across all evaluated tasks, a significant leap from the baseline’s 2.5%.

Notably, RICL-π0-FAST-DROID demonstrated improved language grounding, allowing it to correctly identify and interact with unseen objects. More importantly, it overcame adaptation challenges, inferring novel grasps and motions from its context. In some cases, the model even predicted and executed action sequences that were not explicitly present in the retrieval dataset, suggesting an ability to elicit latent knowledge.

The study also explored the benefits of further fine-tuning the RICL-VLA on the target task demonstrations. This led to even greater performance boosts, with the fine-tuned RICL-VLA achieving a 61.67% aggregate complete task success rate, nearly double that of a vanilla VLA fine-tuned on the same data. This suggests that RICL prepares the VLA to learn more efficiently from new data.

While RICL represents a significant step forward, the authors acknowledge limitations. The current approach primarily focuses on pick-and-place tasks, which are the main strength of the base VLA. It may struggle with significantly more complex or diverse motions. Additionally, it still relies on a few teleoperated demonstrations, and future work aims to explore using human video demonstrations to reduce this dependency.

Also Read:

In conclusion, RICL offers a practical and effective method for injecting in-context adaptability into pre-trained Vision-Language-Action models, making them more versatile and easier for end-users to teach new skills without complex parameter adjustments. This work paves the way for more adaptable and generalist robots. You can find more details about this research paper here: RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

RICL: Enhancing Robot Learning with In-Context Adaptability

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates