TLDR: The paper introduces STORI (STOchastic-ataRI), a new benchmark that adds diverse stochastic effects to Atari environments, and proposes a taxonomy for classifying stochasticity in Reinforcement Learning (RL) environments. It aims to address the limitations of current deterministic benchmarks, which fail to prepare RL agents for real-world uncertainty. Experiments with DreamerV3 and STORM show that stochasticity generally degrades performance, but also reveal how different algorithms cope with various types of uncertainty, highlighting the need for more robust RL systems.
Reinforcement Learning (RL) agents have made significant strides in controlled environments like Atari games, but their ability to perform robustly in the unpredictable real world remains a challenge. Many real-world scenarios, such as autonomous driving or robot navigation, involve inherent noise, partial information, and dynamic changes that current benchmarks often fail to capture.
To bridge this gap, a new research paper introduces STORI (STOchastic-ataRI), a novel benchmark designed to rigorously evaluate RL methods under various forms of uncertainty. Alongside STORI, the researchers propose an updated taxonomy of stochasticity, offering a unified framework to analyze and compare different RL approaches.
The Problem with Current Benchmarks
Traditional RL benchmarks, including many Atari games, are often deterministic or nearly so. While Model-Based RL (MBRL) approaches, which use learned ‘World Models,’ theoretically handle partial observability, they struggle with true stochasticity. This mismatch between simplified benchmarks and the complex, uncertain nature of real-world settings creates a significant barrier to applying RL advances beyond simulations. Existing methods to introduce stochasticity, like ‘sticky actions,’ have been limited in scope, failing to cover the full spectrum of environmental uncertainties such as noisy observations, variable dynamics, or non-stationarity.
Introducing STORI: A New Standard for Uncertainty
STORI extends the classic Atari Learning Environment (ALE) by systematically integrating diverse and controllable sources of stochasticity. This allows for a fine-grained evaluation of how RL agents adapt and perform under multiple dimensions of uncertainty. The benchmark’s design enables researchers to probe algorithmic robustness and strategy formation in scenarios that mirror real-world unpredictability.
A Unified Taxonomy of Stochasticity
The paper presents a comprehensive taxonomy to classify different types of stochasticity in RL environments. This framework helps in understanding and categorizing the various challenges agents might face:
-
Type 0: Deterministic Environment – The most basic, where actions always lead to predictable outcomes, and the state is fully observable.
-
Type 1: Intrinsic Action Dependent Stochasticity – Here, the environment might randomly replace the agent’s chosen action with another, as seen in ‘sticky actions’ where a previous action might repeat. This means the same action can have varied results.
-
Type 2: Intrinsic Action Independent – Random Stochasticity – Randomness arises independently of the agent’s actions, often due to external factors. For example, in Atari Breakout, a ball might randomly bounce back without destroying a block, regardless of the paddle’s movement.
-
Type 3: Intrinsic Action Independent – Concept Drift – The environment’s dynamics change over time, independent of the agent’s actions. This can be sudden (abrupt changes), gradual (slow transitions), or recurring (previously seen dynamics reappear). An example in Atari could be the game difficulty increasing as an agent levels up.
-
Type 4: Partially Observed – Representation Learning – The agent doesn’t have full state information and must infer hidden details or learn a suitable representation from limited observations, like perceiving only screen images in default Atari settings.
-
Type 5: Partially Observed – Missing State Variable(s) – Critical information about certain state variables is entirely missing. Examples include invisible blocks in Breakout or hidden score/clock information in Boxing, forcing agents to make decisions despite perceptual gaps.
Experiments and Key Findings
The researchers used STORI to evaluate two prominent model-based RL algorithms, DreamerV3 and STORM, on modified versions of Atari Breakout and Boxing. These games were chosen for their contrasting action spaces and suitability as baselines.
The experiments revealed that introducing stochasticity generally led to a noticeable decline in performance for both algorithms compared to deterministic settings. However, interesting differences emerged:
-
In Breakout, DreamerV3 initially outperformed STORM in the baseline, but STORM showed slightly stronger performance and greater robustness across most stochasticity types.
-
In Boxing, the performance drop due to stochasticity was less severe. DreamerV3 often outperformed STORM in stochastic Boxing environments, while STORM had a slight edge in the baseline. This difference was attributed to Boxing’s larger action space, which offers more action redundancy and recovery opportunities, mitigating the impact of unexpected outcomes.
-
Intriguingly, in some partially observed scenarios (Type 5A Boxing, where score and clock were hidden), DreamerV3 sometimes performed better than its baseline. This suggests that removing non-essential information can, in certain cases, simplify the agent’s learning process.
-
Agents also demonstrated adaptability, such as in a Type 5B Boxing environment (partially hidden screen), where they learned to exploit spatial constraints by confining opponents to the visible area.
Also Read:
- Agentic Reinforcement Learning: Empowering LLMs as Autonomous Decision-Makers
- AI Agents Tackle Full Storylines in Classic Adventure Games
Looking Ahead
While the study faced limitations, such as computational constraints restricting the number of experimental runs and algorithms tested, STORI provides a powerful and flexible framework for future research. It allows researchers to systematically explore how different forms of uncertainty impact RL algorithms, guiding the development of more resilient and adaptable AI systems for real-world applications. For more details, you can read the full research paper here.


