spot_img
HomeResearch & DevelopmentAdvancing Humanoid AI: Team Revontuli's Winning World Models

Advancing Humanoid AI: Team Revontuli’s Winning World Models

TLDR: Team Revontuli secured first place in both tracks of the 1X World Model Challenge, a benchmark for humanoid interaction. For the ‘sampling’ track, which forecasts future image frames, they adapted the Wan-2.2 TI2V-5B video generation model, conditioning it on robot states using AdaLN-Zero and fine-tuning with LoRA, achieving 23.0 dB PSNR. For the ‘compression’ track, focused on predicting future discrete latent codes, they trained a Spatio-Temporal Transformer from scratch, achieving a Top-500 CE of 6.6386. Their methods demonstrated high performance and remarkable training efficiency, significantly outperforming other competitors.

World models are a fascinating and powerful concept in the field of artificial intelligence and robotics. Imagine a robot that can think about its actions before it performs them, predicting what might happen next in its environment. This ability to ‘imagine’ the future allows robots to plan, anticipate outcomes, and make better decisions without needing constant real-world trial and error. This is precisely what world models aim to achieve, equipping agents with an internal simulator of their surroundings.

The recent 1X World Model Challenge put these advanced concepts to the test, providing an open-source benchmark for real-world humanoid interaction. The challenge was divided into two distinct but complementary tracks: ‘sampling’ and ‘compression’. Team Revontuli, a collaboration of researchers from Aalto University, University of Edinburgh, Deep Render, DataCrunch, and University of Helsinki, emerged victorious in both categories, showcasing cutting-edge approaches to generative world modeling.

The Sampling Challenge: Predicting Future Visuals

In the sampling track, the primary goal was to forecast future image frames, essentially predicting what the robot would ‘see’ in the future. Team Revontuli tackled this by adapting a powerful video generation foundation model called Wan-2.2 TI2V-5B. This model, originally designed for text-image-to-video generation, was modified to predict future frames based on existing video footage and, crucially, the robot’s internal states.

To integrate the robot’s state information, the team employed a technique called AdaLN-Zero within the model’s architecture. This allowed the video generation process to be directly influenced by the robot’s movements and internal conditions. Further refinements were made through a post-training process using LoRA (Low-Rank Adaptation), which efficiently fine-tunes large models. The result was a model that achieved an impressive 23.0 dB PSNR (Peak Signal-to-Noise Ratio) in the sampling task, a key metric for image quality, securing them first place. Interestingly, their inference strategy involved averaging multiple predictions to selectively blur regions of high predictive uncertainty, which proved more effective than traditional blurring for optimizing PSNR scores. You can read more about their technical report here: Generative World Modelling for Humanoids.

The Compression Challenge: Predicting Latent States

The compression track took a different approach, focusing on predicting future discrete latent codes rather than direct pixel-level images. This involves compressing video sequences into a more compact, tokenized representation. For this challenge, Team Revontuli developed a Spatio-Temporal Transformer model from scratch. This model efficiently processes both spatial (within a single frame) and temporal (across frames) information, making it well-suited for video data.

The video sequences were first encoded into discrete tokens using a specialized Cosmos8x8x8 tokeniser. The Spatio-Temporal Transformer then learned to predict the next sequence of these tokens, effectively forecasting the compressed future state of the environment. This model achieved a Top-500 Cross-Entropy (CE) of 6.6386, again earning them first place. The team found that a greedy decoding strategy during inference, which selects the most probable sequence of tokens at each step, offered a practical balance of speed and accuracy.

Also Read:

Remarkable Performance and Efficiency

Beyond their top-ranking performance in both challenges, Team Revontuli also highlighted the remarkable efficiency of their training process. They managed to achieve their first-place sampling results in just 36 hours using a DataCrunch instant cluster, significantly faster than the runner-up who reportedly took about a month. Similarly, their compression model was trained in under 17 hours. This speed demonstrates the power of leveraging pre-trained foundation models and efficient training infrastructure.

In conclusion, Team Revontuli’s work in the 1X World Model Challenge represents a significant step forward in equipping humanoid robots with sophisticated internal simulators. By excelling in both visual prediction (sampling) and efficient state representation (compression), their methods offer valuable insights that will likely influence future developments in robotics and generative AI.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -