Unveiling STORI: A New Standard for Stochastic RL Evaluation

TLDR: The paper introduces STORI (STOchastic-ataRI), a new benchmark that adds diverse stochastic effects to Atari environments, and proposes a taxonomy for classifying stochasticity in Reinforcement Learning (RL) environments. It aims to address the limitations of current deterministic benchmarks, which fail to prepare RL agents for real-world uncertainty. Experiments with DreamerV3 and STORM show that stochasticity generally degrades performance, but also reveal how different algorithms cope with various types of uncertainty, highlighting the need for more robust RL systems.

Reinforcement Learning (RL) agents have made significant strides in controlled environments like Atari games, but their ability to perform robustly in the unpredictable real world remains a challenge. Many real-world scenarios, such as autonomous driving or robot navigation, involve inherent noise, partial information, and dynamic changes that current benchmarks often fail to capture.

To bridge this gap, a new research paper introduces STORI (STOchastic-ataRI), a novel benchmark designed to rigorously evaluate RL methods under various forms of uncertainty. Alongside STORI, the researchers propose an updated taxonomy of stochasticity, offering a unified framework to analyze and compare different RL approaches.

The Problem with Current Benchmarks

Traditional RL benchmarks, including many Atari games, are often deterministic or nearly so. While Model-Based RL (MBRL) approaches, which use learned ‘World Models,’ theoretically handle partial observability, they struggle with true stochasticity. This mismatch between simplified benchmarks and the complex, uncertain nature of real-world settings creates a significant barrier to applying RL advances beyond simulations. Existing methods to introduce stochasticity, like ‘sticky actions,’ have been limited in scope, failing to cover the full spectrum of environmental uncertainties such as noisy observations, variable dynamics, or non-stationarity.

Introducing STORI: A New Standard for Uncertainty

STORI extends the classic Atari Learning Environment (ALE) by systematically integrating diverse and controllable sources of stochasticity. This allows for a fine-grained evaluation of how RL agents adapt and perform under multiple dimensions of uncertainty. The benchmark’s design enables researchers to probe algorithmic robustness and strategy formation in scenarios that mirror real-world unpredictability.

A Unified Taxonomy of Stochasticity

The paper presents a comprehensive taxonomy to classify different types of stochasticity in RL environments. This framework helps in understanding and categorizing the various challenges agents might face:

Type 0: Deterministic Environment – The most basic, where actions always lead to predictable outcomes, and the state is fully observable.
Type 1: Intrinsic Action Dependent Stochasticity – Here, the environment might randomly replace the agent’s chosen action with another, as seen in ‘sticky actions’ where a previous action might repeat. This means the same action can have varied results.
Type 2: Intrinsic Action Independent – Random Stochasticity – Randomness arises independently of the agent’s actions, often due to external factors. For example, in Atari Breakout, a ball might randomly bounce back without destroying a block, regardless of the paddle’s movement.
Type 3: Intrinsic Action Independent – Concept Drift – The environment’s dynamics change over time, independent of the agent’s actions. This can be sudden (abrupt changes), gradual (slow transitions), or recurring (previously seen dynamics reappear). An example in Atari could be the game difficulty increasing as an agent levels up.
Type 4: Partially Observed – Representation Learning – The agent doesn’t have full state information and must infer hidden details or learn a suitable representation from limited observations, like perceiving only screen images in default Atari settings.
Type 5: Partially Observed – Missing State Variable(s) – Critical information about certain state variables is entirely missing. Examples include invisible blocks in Breakout or hidden score/clock information in Boxing, forcing agents to make decisions despite perceptual gaps.

Experiments and Key Findings

The researchers used STORI to evaluate two prominent model-based RL algorithms, DreamerV3 and STORM, on modified versions of Atari Breakout and Boxing. These games were chosen for their contrasting action spaces and suitability as baselines.

The experiments revealed that introducing stochasticity generally led to a noticeable decline in performance for both algorithms compared to deterministic settings. However, interesting differences emerged:

In Breakout, DreamerV3 initially outperformed STORM in the baseline, but STORM showed slightly stronger performance and greater robustness across most stochasticity types.
In Boxing, the performance drop due to stochasticity was less severe. DreamerV3 often outperformed STORM in stochastic Boxing environments, while STORM had a slight edge in the baseline. This difference was attributed to Boxing’s larger action space, which offers more action redundancy and recovery opportunities, mitigating the impact of unexpected outcomes.
Intriguingly, in some partially observed scenarios (Type 5A Boxing, where score and clock were hidden), DreamerV3 sometimes performed better than its baseline. This suggests that removing non-essential information can, in certain cases, simplify the agent’s learning process.
Agents also demonstrated adaptability, such as in a Type 5B Boxing environment (partially hidden screen), where they learned to exploit spatial constraints by confining opponents to the visible area.

Also Read:

Looking Ahead

While the study faced limitations, such as computational constraints restricting the number of experimental runs and algorithms tested, STORI provides a powerful and flexible framework for future research. It allows researchers to systematically explore how different forms of uncertainty impact RL algorithms, guiding the development of more resilient and adaptable AI systems for real-world applications. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unveiling STORI: A New Standard for Stochastic RL Evaluation

The Problem with Current Benchmarks

Introducing STORI: A New Standard for Uncertainty

A Unified Taxonomy of Stochasticity

Experiments and Key Findings

Looking Ahead

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

A New Way to Disentangle Data for Scientific Exploration

SiegPath Honored with ‘Most Innovative Fintech Award’ at AI Expo Europe 2025 for AI-Driven Solutions

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates