TLDR: WorMI (World Model Implanting) is a novel framework that enables embodied AI agents to adapt robustly to new environments at test time without extensive retraining. It combines large language models (LLMs) with independently learned, domain-specific world models through a two-stage process: prototype-based retrieval to select relevant models and a world-wise compound attention mechanism to integrate and align their knowledge with the LLM’s reasoning. This approach significantly improves zero-shot and few-shot performance in unseen domains, demonstrating enhanced adaptability and data efficiency for AI agents.
In the rapidly evolving field of embodied artificial intelligence, a significant hurdle has been enabling AI agents to seamlessly adapt to new and unfamiliar environments without the need for extensive retraining or data collection. Imagine a robot trained in one type of kitchen suddenly needing to operate in a completely different one – traditionally, this would require a lot of effort. A new framework called WorMI (World Model Implanting) offers an innovative solution to this challenge, allowing embodied agents to dynamically adapt their knowledge at the moment of action.
WorMI tackles this problem by combining the powerful reasoning capabilities of large language models (LLMs) with specialized, domain-specific “world models.” These world models are like mini-experts, each trained on a particular environment or task. The brilliance of WorMI lies in its ability to “implant” and “remove” these expert world models as needed, allowing the agent’s core policy to remain flexible and adaptable across various domains.
How WorMI Works: A Dual-Stage Approach
The framework integrates two key methods to achieve its adaptive capabilities:
1. Prototype-based World Model Retrieval: At test time, when an agent encounters a new situation, WorMI doesn’t try to use all its world models at once. Instead, it intelligently retrieves only the most relevant ones. It does this by comparing the current environment’s characteristics (represented by “object-wise state embeddings”) to “prototypes” derived from each world model’s training data. These prototypes are like concise summaries of what each world model knows, allowing for efficient and accurate selection of the best-suited models.
2. World-wise Compound Attention: Once the relevant world models are identified, their knowledge needs to be integrated and aligned with the LLM’s reasoning. This is where the compound attention mechanism comes in. It uses a hierarchical cross-attention process. First, it integrates the intermediate representations from the selected world models, essentially combining their domain-specific insights. Then, it aligns this integrated knowledge with the LLM’s own reasoning process, ensuring that the agent’s decisions are informed by both general intelligence and specific environmental understanding.
This dual-stage design allows WorMI to effectively fuse domain-specific knowledge from multiple sources, leading to robust adaptation even in completely unseen environments. The framework is also designed with meta-learning, which means it learns how to learn, making the compound attention module highly efficient in adapting with minimal new data.
Impressive Performance in Complex Environments
WorMI’s effectiveness has been rigorously tested on two prominent embodied AI benchmarks: VirtualHome, a 3D simulation for household tasks, and ALFWorld, a text-based environment for indoor task simulation. The results demonstrate superior performance compared to several state-of-the-art LLM-based approaches, particularly in scenarios where the agent encounters entirely new tasks and scenes.
For instance, in VirtualHome, WorMI showed a significant improvement in success rate (SR) and a reduction in pending steps (PS) over a leading baseline, SayCanPay. In zero-shot scenarios (where no target domain data is provided), WorMI achieved a 20.41% increase in SR and a 20.32% improvement in PS. In few-shot scenarios (with minimal new data), it achieved an average 26.58% gain in SR and a 4.98 step reduction in PS in VirtualHome, with similar gains in ALFWorld.
Further analysis revealed that WorMI’s world-level attention dynamically shifts its focus among different world models based on the current task, highlighting its context-aware reasoning. Ablation studies confirmed the critical roles of both the prototype-based retrieval and the compound attention mechanism in achieving these results. The framework also demonstrated consistent outperformance across various LLM sizes and showed promising scalability with the number of world models.
WorMI also proved robust in handling complex instructions, such as long-horizon tasks (sequences of sub-goals) and multiple concurrent instructions, achieving higher success rates and greater efficiency compared to baselines.
Also Read:
- LLM-Driven Policy Diffusion: A New Path to Generalization in Offline Reinforcement Learning
- Adapting Robot Intelligence: A New Framework for Vision-Language-Action Models
Looking Ahead
The WorMI framework represents a significant step forward in enabling embodied agents to achieve scalable and real-world deployment. By allowing dynamic composition of domain-specific knowledge at test time, it addresses the crucial need for adaptability and data efficiency in ever-changing environments. While computational overhead with many world models and reliance on the underlying LLM are current limitations, the framework’s potential for flexible, intelligent agents is clear. You can read the full research paper here: World Model Implanting for Test-time Adaptation of Embodied Agents.


