TLDR: iLearnRobot is an innovative system that enables multi-modal robots to continuously improve their performance by learning from natural dialogues with non-expert users. It uses a ‘Chain of Question’ to clarify user intent, ‘Dual-Modality Retrieval’ to avoid repeating mistakes based on past interactions, and a ‘Data Construction’ module to build a history event database. This data is then used to fine-tune the robot’s underlying Multi-modal Large Language Model (MLLM), leading to enhanced adaptability and a better user experience in novel scenarios.
Robots are increasingly becoming part of our daily lives, but a significant challenge remains: how can they improve their performance after being deployed, especially when encountering situations they’ve never seen before? Traditional methods often involve expert teams collecting and annotating data, which is not practical for the vast array of scenarios robots might face. Furthermore, robots powered by Multi-modal Large Language Models (MLLMs) often struggle with ambiguous user questions and tend to repeat mistakes until a full model update can be performed, which can take considerable time and resources.
A new research paper introduces an innovative solution called iLearnRobot, an interactive learning-based multi-modal robot system designed for continuous improvement. This system stands out by enabling robots to learn directly from natural conversations with everyday users, addressing the common issues of ambiguity and repetitive errors.
How iLearnRobot Works
The iLearnRobot system integrates several key modules to achieve its continuous learning capability:
First, the Chain of Question module tackles the problem of ambiguous user queries. Imagine asking a robot, “What is that?” without further context. Instead of guessing, iLearnRobot will engage in a series of clarifying questions with the user until it confidently understands the precise intent. This iterative dialogue process significantly enhances the user experience by ensuring the robot provides relevant answers.
Second, the Dual-Modality Retrieval module is designed to prevent the robot from repeating past mistakes. In real-world applications, errors are inevitable. Before a full model update can occur, this module allows the robot to search its history event database for similar past interactions. If a match is found, the robot uses the previously corrected answer as a reference to generate a more accurate response for the current query. This ensures immediate improvement and a smoother user experience, even before the underlying MLLM is updated.
Third, the Data Construction module plays a crucial role in building the robot’s knowledge base. Human verbal corrections and interactions are invaluable. This module distills complete user interactions—including multi-round dialogues, the associated image, the user’s precise question, and the correct answer—into a structured format. This distilled data is then stored in a history event database, ready for retrieval and future model updates.
Finally, the Model Update process ensures long-term improvement. When a sufficient number of interaction events accumulate in the history database, the MLLM (specifically, LLaVA-NeXT in this research) is fine-tuned using this new data. This iterative training cycle allows the robot to continuously enhance its intrinsic capabilities, adapting to new scenarios and improving its perception, understanding, and recognition over time.
Also Read:
- Enhancing AI: How Augmented Vision-Language Models Bridge the Gap Between Perception and Reasoning
- Villa-X: A New Approach to Teaching Robots with Abstract Actions
Experimental Validation
The researchers conducted experiments using 10 similar medicine bottles, a challenging novel scenario for the robot. They performed three rounds of testing with 25 participants:
- Round 1 (Initial Test): The robot used a public pre-trained MLLM with the Chain of Question module but no retrieval. Accuracy was very low, as expected for novel items.
- Round 2 (Retrieval Test): The robot could access the history events collected from Round 1. Performance significantly improved due to the Dual-Modality Retrieval module, even without fine-tuning.
- Round 3 (Fine-tuned Test): The MLLM was fine-tuned on data from the first two rounds. The robot achieved the best results, demonstrating superior accuracy and user experience based on its newly learned intrinsic knowledge.
A key finding from the fine-tuning process was that updating the visual encoder’s weights alongside the projector layer and LLM significantly improved performance in novel scenarios with subtle visual differences. This suggests that for highly specific or new visual tasks, allowing the visual perception component to learn is crucial.
The iLearnRobot framework represents a significant step forward in making robots more adaptable and user-friendly. By integrating interactive learning, it allows robots to continuously improve from everyday dialogues, paving the way for more capable and reliable robotic systems in diverse environments. You can read more about this research in the full paper available at arXiv.org.


