Pixels to Play: A New AI Model Learns to Master 3D Games from Visuals

TLDR: Pixels2Play-0.1 (P2P0.1) is a new foundation model designed to play a wide range of 3D video games directly from pixel input, mimicking human-like behavior. It uses behavior cloning, combining labeled human gameplay with unlabeled public videos, where an inverse-dynamics model infers actions. The model, built on a decoder-only transformer, aims to generalize to new titles with minimal game-specific engineering. Early results show competent play in Roblox and MS-DOS games at a novice human level, with unlabeled data significantly improving generalization. The research paves the way for AI companions, adaptive NPCs, and automated game testing.

A new foundation model named Pixels2Play-0.1, or P2P0.1, has been introduced, designed to learn and play a wide variety of 3D video games by observing the same pixel stream available to human players. This innovative model aims to generalize to new game titles with minimal game-specific adjustments, mimicking human-like behavior in its gameplay.

The motivation behind P2P0.1 stems from several emerging applications in the gaming world. Imagine having AI teammates that can genuinely cooperate, non-player characters (NPCs) that adapt dynamically rather than relying on rigid scripts, personalized live-streamers that can play on demand, or even automated quality assurance testers that can explore game environments for bugs. Current large language models (LLMs) and visual language models (VLMs), despite their advancements, often struggle with the complex, real-time demands of video games, highlighting a significant gap that P2P0.1 seeks to bridge.

P2P0.1 is trained using a method called behavior cloning, where it learns from demonstrations of human gameplay. This involves a combination of carefully labeled demonstrations and a vast amount of unlabeled public video content. To make use of the unlabeled videos, the researchers developed an inverse-dynamics model (IDM) that can infer the actions taken by players in those videos, effectively turning them into additional training data. The core of P2P0.1 is a decoder-only transformer, a type of neural network architecture, which processes video frames and generates actions in an auto-regressive manner, meaning it predicts actions step-by-step. This design allows it to handle the complex and varied action spaces found in games, from keyboard presses to mouse movements, while remaining efficient enough to run on a single consumer graphics card.

The model’s ability to learn from unlabeled data is a crucial aspect, as curated, labeled gameplay demonstrations are far less abundant than general gameplay videos available online. The IDM acts as a bridge, allowing the model to leverage this wealth of unannotated content. The researchers also experimented with different ways to process game images, finding that tokenizers specifically trained on game visuals performed better than those trained on general photos, as games often require attention to small, fast-moving details.

For data collection, the team used a two-step filtering process for unlabeled videos, employing commercial VLMs to ensure relevance and remove non-gameplay segments. Labeled data was gathered from paid annotators playing specific games, and they are also exploring collecting gameplay data from product users with their consent. The team addressed challenges like differences in video compression and image resizing between training and inference by using data augmentation and consistent processing methods.

Currently, P2P0.1 has been tested on simpler Roblox games and classic MS-DOS titles. Qualitatively, the model demonstrates competent play at the level of a novice human player, meaning it can play most games it was trained on, though a skilled human would still outperform it. Evaluating performance across a wide variety of games automatically is a significant challenge that the researchers are actively working on. Initial experiments show that incorporating unlabeled data significantly improves the model’s ability to generalize to new situations, reducing overfitting compared to models trained only on limited labeled data.

Also Read:

The developers envision a future where P2P0.1 evolves to handle more complex 3D titles, with ongoing work focused on refining its architecture, expanding its training data, and increasing its capacity. They also aim to extend the model’s ability to reason over longer periods of gameplay, which is essential for mastering more intricate games. This research represents an exciting step towards creating versatile AI agents that can interact with and play games in a truly human-like and adaptable manner. You can find more details about this work in the full research paper: Pixels to Play: A Foundation Model for 3D Gameplay.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Pixels to Play: A New AI Model Learns to Master 3D Games from Visuals

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

BRYGE AI Secures Silver Stevie® Award for Groundbreaking Health Tech Product for Women

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates