Advancing Humanoid AI: Team Revontuli's Winning World Models

TLDR: Team Revontuli secured first place in both tracks of the 1X World Model Challenge, a benchmark for humanoid interaction. For the ‘sampling’ track, which forecasts future image frames, they adapted the Wan-2.2 TI2V-5B video generation model, conditioning it on robot states using AdaLN-Zero and fine-tuning with LoRA, achieving 23.0 dB PSNR. For the ‘compression’ track, focused on predicting future discrete latent codes, they trained a Spatio-Temporal Transformer from scratch, achieving a Top-500 CE of 6.6386. Their methods demonstrated high performance and remarkable training efficiency, significantly outperforming other competitors.

World models are a fascinating and powerful concept in the field of artificial intelligence and robotics. Imagine a robot that can think about its actions before it performs them, predicting what might happen next in its environment. This ability to ‘imagine’ the future allows robots to plan, anticipate outcomes, and make better decisions without needing constant real-world trial and error. This is precisely what world models aim to achieve, equipping agents with an internal simulator of their surroundings.

The recent 1X World Model Challenge put these advanced concepts to the test, providing an open-source benchmark for real-world humanoid interaction. The challenge was divided into two distinct but complementary tracks: ‘sampling’ and ‘compression’. Team Revontuli, a collaboration of researchers from Aalto University, University of Edinburgh, Deep Render, DataCrunch, and University of Helsinki, emerged victorious in both categories, showcasing cutting-edge approaches to generative world modeling.

The Sampling Challenge: Predicting Future Visuals

In the sampling track, the primary goal was to forecast future image frames, essentially predicting what the robot would ‘see’ in the future. Team Revontuli tackled this by adapting a powerful video generation foundation model called Wan-2.2 TI2V-5B. This model, originally designed for text-image-to-video generation, was modified to predict future frames based on existing video footage and, crucially, the robot’s internal states.

To integrate the robot’s state information, the team employed a technique called AdaLN-Zero within the model’s architecture. This allowed the video generation process to be directly influenced by the robot’s movements and internal conditions. Further refinements were made through a post-training process using LoRA (Low-Rank Adaptation), which efficiently fine-tunes large models. The result was a model that achieved an impressive 23.0 dB PSNR (Peak Signal-to-Noise Ratio) in the sampling task, a key metric for image quality, securing them first place. Interestingly, their inference strategy involved averaging multiple predictions to selectively blur regions of high predictive uncertainty, which proved more effective than traditional blurring for optimizing PSNR scores. You can read more about their technical report here: Generative World Modelling for Humanoids.

The Compression Challenge: Predicting Latent States

The compression track took a different approach, focusing on predicting future discrete latent codes rather than direct pixel-level images. This involves compressing video sequences into a more compact, tokenized representation. For this challenge, Team Revontuli developed a Spatio-Temporal Transformer model from scratch. This model efficiently processes both spatial (within a single frame) and temporal (across frames) information, making it well-suited for video data.

The video sequences were first encoded into discrete tokens using a specialized Cosmos8x8x8 tokeniser. The Spatio-Temporal Transformer then learned to predict the next sequence of these tokens, effectively forecasting the compressed future state of the environment. This model achieved a Top-500 Cross-Entropy (CE) of 6.6386, again earning them first place. The team found that a greedy decoding strategy during inference, which selects the most probable sequence of tokens at each step, offered a practical balance of speed and accuracy.

Also Read:

Remarkable Performance and Efficiency

Beyond their top-ranking performance in both challenges, Team Revontuli also highlighted the remarkable efficiency of their training process. They managed to achieve their first-place sampling results in just 36 hours using a DataCrunch instant cluster, significantly faster than the runner-up who reportedly took about a month. Similarly, their compression model was trained in under 17 hours. This speed demonstrates the power of leveraging pre-trained foundation models and efficient training infrastructure.

In conclusion, Team Revontuli’s work in the 1X World Model Challenge represents a significant step forward in equipping humanoid robots with sophisticated internal simulators. By excelling in both visual prediction (sampling) and efficient state representation (compression), their methods offer valuable insights that will likely influence future developments in robotics and generative AI.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Advancing Humanoid AI: Team Revontuli’s Winning World Models

The Sampling Challenge: Predicting Future Visuals

The Compression Challenge: Predicting Latent States

Remarkable Performance and Efficiency

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates