TLDR: CrafterDojo introduces a suite of foundation models (CrafterVPT, CrafterCLIP, CrafterSteve-1) and data generation toolkits (Expert Behavior Generator, Caption Generator) for the Crafter environment. This initiative aims to make Crafter a lightweight, prototyping-friendly testbed for general-purpose embodied AI, addressing limitations of Minecraft. The models enable behavioral priors, vision-language grounding, and instruction-following, demonstrating strong performance individually and when integrated for complex, long-horizon tasks through hierarchical planning.
Researchers from KAIST have introduced CrafterDojo, a new suite of foundation models and tools designed to advance the development of general-purpose embodied agents. This initiative aims to transform Crafter, a lightweight alternative to Minecraft, into a more accessible and prototyping-friendly testbed for AI research. While Minecraft has been a popular environment for embodied AI, its complexity, slow simulation speed, and engineering overhead often hinder rapid experimentation.
Crafter, a 2D top-down, grid-based environment, offers similar challenges to Minecraft, such as procedural map generation, resource collection, tool crafting, survival, and combat, but with a simpler Python implementation. Despite its advantages, Crafter’s utility has been limited due to the absence of foundation models, which have been crucial for progress in Minecraft-based research.
CrafterDojo addresses this gap by providing three key foundation models: CrafterVPT (C-VPT), CrafterCLIP (C-CLIP), and CrafterSteve-1 (C-Steve-1). C-VPT focuses on learning behavioral priors, essentially teaching agents how to perform basic actions and skills. C-CLIP enables vision-language grounding, allowing agents to understand and connect visual information with textual descriptions. Lastly, C-Steve-1 facilitates instruction-following, enabling agents to execute commands given in natural language.
A significant challenge in developing these models for Crafter was the lack of large-scale behavioral and caption data, unlike Minecraft which benefits from abundant online human demonstrations. To overcome this, CrafterDojo introduces two innovative toolkits for automatic data generation: the Expert Behavior Generator and the Caption Generator.
Expert Behavior Generator and CrafterPlay Dataset
The Expert Behavior Generator toolkit trains an expert policy using reinforcement learning to create high-quality, large-scale synthetic demonstrations. This toolkit was used to generate the CrafterPlay dataset, comprising 20,000 episodes with approximately 180 million timesteps. This dataset provides robust behavioral data, allowing C-VPT to learn diverse and general-purpose behaviors, including constructing shelters, strategically moving, blocking attacks, and building paths over water.
Caption Generator and CrafterCaption Dataset
The Caption Generator toolkit employs a rule-based system to automatically create descriptive captions for video segments based on agent behavior and environment state changes. This process led to the creation of the CrafterCaption dataset, containing around 2.3 million video-caption pairs. To enhance linguistic diversity, the toolkit also integrates LLM-based augmentation, generating paraphrased variants of captions. This significantly improves the quality of representations and downstream task performance for models like C-CLIP.
CrafterDojo Foundation Models in Detail
CrafterVPT (C-VPT): This model builds upon the Video PreTraining (VPT) concept from Minecraft. Trained on the CrafterPlay dataset, C-VPT learns a wide spectrum of agent capabilities. Experiments show that C-VPT significantly outperforms previous methods in Crafter Score and Return, demonstrating its effectiveness as a behavioral foundation. It even exhibits emergent behaviors not explicitly tied to achievements, like building shelters or bridges.
CrafterCLIP (C-CLIP): Inspired by MineCLIP, C-CLIP enables vision-language alignment in Crafter. Trained on the CrafterCaption dataset, C-CLIP achieves high recall rates in vision-language alignment tasks, proving its reliability for understanding visual scenes in conjunction with textual descriptions. This is crucial for agents that need to interpret instructions based on what they see.
CrafterSteve-1 (C-Steve-1): Adapting Steve-1 from Minecraft, C-Steve-1 is designed for instruction-following. It leverages C-VPT for behavioral priors and C-CLIP for vision-language understanding. To address the unique challenge of short, distinct tasks in Crafter, the researchers developed an “event-based packed hindsight relabeling” method for dataset generation. C-Steve-1 demonstrates near-perfect success rates on single-step instruction tasks, showcasing its ability to interpret and execute language commands.
Also Read:
- Enhancing Robot Dexterity: A New Approach to Vision-Language-Action Planning
- HeroBench: A New Standard for Evaluating AI’s Long-Term Planning in Virtual Worlds
Enabling Long-Horizon Tasks with Hierarchical Planning
The paper also explores how these foundation models can be integrated into hierarchical agents to tackle complex, long-horizon tasks. By combining a high-level planner (e.g., PPO-based or heuristic) that selects language instructions with C-Steve-1 as a low-level controller, the PPO-Steve agent achieves competitive performance on multi-step tasks. This hierarchical approach significantly outperforms agents trained from scratch or those relying solely on behavioral priors or single instructions, highlighting the importance of comprehensive planning for complex sequential tasks in Crafter.
CrafterDojo represents a significant step towards making Crafter a robust and efficient testbed for general-purpose embodied AI research. By providing essential foundation models, datasets, and toolkits, it enables faster iteration and innovation in agent development before scaling to more complex environments like Minecraft. For more technical details, you can refer to the full research paper here.


